Our research focuses on performing Modelling on Google Play Store apps to uncover patterns, trends, and insights regarding app characteristics, user behavior, and installation patterns. We are trying to see how app popularity, defined as the number of installs, is impacted by the top five categories, last updated, app sizes, version, and other factors.
“Our research focuses on performing Modelling on Google Play Store apps to uncover patterns, trends, and insights regarding app characteristics, user behavior, and installation patterns. We are trying to see how app popularity, defined as the number of installs, is impacted by the top five categories, last updated, app sizes, version, and other factors.”
Specific: The question focuses on identifying how app popularity (defined by the number of installs) is influenced by well-defined variables, including the top five app categories, last update date, app size, version, and additional characteristics such as content rating, pricing model, and user reviews. It also aims to uncover specific patterns and trends in user behavior and installation patterns.
Measurable: The impact of each variable (categories, last update, size, version) on app popularity is quantifiable using metrics such as the number of installs, user reviews, ratings, app size in MB, frequency of updates, and category-specific rankings. This ensures that results can be expressed numerically or statistically.
Achievable: Given the availability of historical data from the Google Play Store (e.g., datasets spanning years and including app attributes), the analysis is feasible using data analysis techniques, statistical modeling, and machine learning. Open-source libraries and tools can efficiently handle the data preprocessing and modeling.
Relevant: The research is pertinent to app developers, marketers, and stakeholders in the mobile app ecosystem. Understanding the factors driving app installs directly addresses key industry challenges, such as improving app visibility, optimizing user engagement, and tailoring marketing strategies for success.
Time-specific: The research will use data from a specific timeframe (e.g., 2010-2018), ensuring that insights are grounded in a defined historical context. The results could also include temporal trends to observe how factors influencing popularity have evolved over time.
This research aims to analyze Google Play Store apps to uncover patterns, trends, and insights into how app characteristics influence popularity, defined by the number of installs. The study will involve systematic steps, including data cleaning, exploratory data analysis (EDA), modeling, and evaluation, to address the SMART research questions.
Here, we have loaded the dataset ‘Google Play Store Apps’ stored in csv file using ()
#Loading the Dataset
data_apps <- data.frame(read.csv("googleplaystore.csv"))
#Checking the structure of the data
str(data_apps)
## 'data.frame': 10841 obs. of 13 variables:
## $ App : chr "Photo Editor & Candy Camera & Grid & ScrapBook" "Coloring book moana" "U Launcher Lite – FREE Live Cool Themes, Hide Apps" "Sketch - Draw & Paint" ...
## $ Category : chr "ART_AND_DESIGN" "ART_AND_DESIGN" "ART_AND_DESIGN" "ART_AND_DESIGN" ...
## $ Rating : num 4.1 3.9 4.7 4.5 4.3 4.4 3.8 4.1 4.4 4.7 ...
## $ Reviews : chr "159" "967" "87510" "215644" ...
## $ Size : chr "19M" "14M" "8.7M" "25M" ...
## $ Installs : chr "10,000+" "500,000+" "5,000,000+" "50,000,000+" ...
## $ Type : chr "Free" "Free" "Free" "Free" ...
## $ Price : chr "0" "0" "0" "0" ...
## $ Content.Rating: chr "Everyone" "Everyone" "Everyone" "Teen" ...
## $ Genres : chr "Art & Design" "Art & Design;Pretend Play" "Art & Design" "Art & Design" ...
## $ Last.Updated : chr "January 7, 2018" "January 15, 2018" "August 1, 2018" "June 8, 2018" ...
## $ Current.Ver : chr "1.0.0" "2.0.0" "1.2.4" "Varies with device" ...
## $ Android.Ver : chr "4.0.3 and up" "4.0.3 and up" "4.0.3 and up" "4.2 and up" ...
#First 5 rows of the dataset
head(data_apps)
## App Category Rating
## 1 Photo Editor & Candy Camera & Grid & ScrapBook ART_AND_DESIGN 4.1
## 2 Coloring book moana ART_AND_DESIGN 3.9
## 3 U Launcher Lite – FREE Live Cool Themes, Hide Apps ART_AND_DESIGN 4.7
## 4 Sketch - Draw & Paint ART_AND_DESIGN 4.5
## 5 Pixel Draw - Number Art Coloring Book ART_AND_DESIGN 4.3
## 6 Paper flowers instructions ART_AND_DESIGN 4.4
## Reviews Size Installs Type Price Content.Rating Genres
## 1 159 19M 10,000+ Free 0 Everyone Art & Design
## 2 967 14M 500,000+ Free 0 Everyone Art & Design;Pretend Play
## 3 87510 8.7M 5,000,000+ Free 0 Everyone Art & Design
## 4 215644 25M 50,000,000+ Free 0 Teen Art & Design
## 5 967 2.8M 100,000+ Free 0 Everyone Art & Design;Creativity
## 6 167 5.6M 50,000+ Free 0 Everyone Art & Design
## Last.Updated Current.Ver Android.Ver
## 1 January 7, 2018 1.0.0 4.0.3 and up
## 2 January 15, 2018 2.0.0 4.0.3 and up
## 3 August 1, 2018 1.2.4 4.0.3 and up
## 4 June 8, 2018 Varies with device 4.2 and up
## 5 June 20, 2018 1.1 4.4 and up
## 6 March 26, 2017 1.0 2.3 and up
# Checking the type of the App
typeof(data_apps$App)
## [1] "character"
#Display all the duplicated Apps
duplicate_apps <- aggregate(App ~ ., data = data_apps, FUN = length)
duplicate_apps <- duplicate_apps[duplicate_apps$App > 1, ]
duplicate_apps <- duplicate_apps[order(-duplicate_apps$App), ]
#View(duplicate_apps)
#print(duplicate_apps)
print(paste("Number of duplicated Apps:",nrow(duplicate_apps)))
## [1] "Number of duplicated Apps: 404"
#Removing Na values and duplicates
data_clean <- data_apps[!is.na(data_apps$App), ]
data_clean <- data_clean[!duplicated(data_clean$App), ]
#(After removing the duplicates) Unique values
unique_apps <- length(unique(data_clean$App))
print(paste("Number of unique apps after removing the duplicates:", unique_apps))
## [1] "Number of unique apps after removing the duplicates: 9660"
Duplicate App Analysis:
str(data_clean$App)
## chr [1:9660] "Photo Editor & Candy Camera & Grid & ScrapBook" ...
typeof(data_apps$Price)
## [1] "character"
There is ‘$’ present after each price of the App. Check and remove before conversion.
#To check if there is dollar symbol present
#data_clean$Price[]
# Remove dollar symbols and convert to numeric
data_clean$Price <- as.numeric(gsub("\\$", "", data_clean$Price))
#Recheck for dollar symbol
#data_clean$Price[]
All the dollar symbols are removed succesfully.
# Summary statistics for price
summary(data_clean$Price)
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## 0.000 0.000 0.000 1.099 0.000 400.000 1
From the unique_df, there is a missing value present in the Price column. Let’s handle it!
missing_na <- is.na(data_clean$Price)
missing_blank <- data_clean$Price == ""
sum(missing_na)
## [1] 1
sum(missing_blank, na.rm = TRUE)
## [1] 0
# Remove row where Price is NA or blank
data_clean <- data_clean[!is.na(data_clean$Price) & data_clean$Price != "", ]
Have removed one row #10473 which app does not have a category nameas it is not relevant to our analysis.
#Recheck for missing values
summary(data_clean$Price)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.000 0.000 0.000 1.099 0.000 400.000
#Checking the type of Type variable
table(data_clean$Type)
##
## Free Paid
## 8902 756
From the price column, we can see 8903 apps are free but it is misread somewhere in the Type column. So lets check!
#Checking for Missing values
print(paste("Missing values:",sum(is.na(data_clean$Type))))
## [1] "Missing values: 0"
data_clean[is.na(data_clean$Type), ]
## [1] App Category Rating Reviews Size
## [6] Installs Type Price Content.Rating Genres
## [11] Last.Updated Current.Ver Android.Ver
## <0 rows> (or 0-length row.names)
# Replace NaN or missing values in the Type column with "Free"
data_clean$Type[is.na(data_clean$Type)] <- "Free"
There is one row 9150, has a missing value for Type. As the price is 0, replaced it with “Free”.
# Checking the type of the Size
typeof(data_apps$Size)
## [1] "character"
# Replace "Varies with Device" in the Size column with NA
data_clean$Size[data_clean$Size == "Varies with device"] <- NA
data_clean <- data_clean[!grepl("\\+", data_clean$Size), ]
data_clean$Size <- ifelse(grepl("k", data_clean$Size),
as.numeric(gsub("k", "", data_clean$Size)) *
0.001, # Convert "K" to MB
as.numeric(gsub("M", "", data_clean$Size)))
# Remove "M" for megabytes
# Calculate and display the mean size for each category in the 'Type' column
mean_size_by_type <- tapply(data_clean$Size, data_clean$Category,
mean, na.rm = TRUE)
print(mean_size_by_type)
## ART_AND_DESIGN AUTO_AND_VEHICLES BEAUTY BOOKS_AND_REFERENCE
## 12.370968 20.037147 13.795745 13.134701
## BUSINESS COMICS COMMUNICATION DATING
## 13.867194 13.794959 11.307430 15.661119
## EDUCATION ENTERTAINMENT EVENTS FAMILY
## 19.057101 23.043750 13.963754 27.187988
## FINANCE FOOD_AND_DRINK GAME HEALTH_AND_FITNESS
## 17.368127 20.494318 41.866609 20.669707
## HOUSE_AND_HOME LIBRARIES_AND_DEMO LIFESTYLE MAPS_AND_NAVIGATION
## 15.970258 10.602883 14.844916 16.368121
## MEDICAL NEWS_AND_MAGAZINES PARENTING PERSONALIZATION
## 19.189399 12.470189 22.512963 11.224624
## PHOTOGRAPHY PRODUCTIVITY SHOPPING SOCIAL
## 15.666158 12.342505 15.491435 15.984090
## SPORTS TOOLS TRAVEL_AND_LOCAL VIDEO_PLAYERS
## 24.058361 8.782837 24.204410 15.792756
## WEATHER
## 12.680036
# Loop through each row and replace NA values in the Size column with the mean size of the corresponding category
data_clean$Size <- ifelse(
is.na(data_clean$Size), # Check if Size is NA
round(mean_size_by_type[data_clean$Category], 1), # Replace with the mean size based on the Category
data_clean$Size # Keep the original size if it's not NA
)
####Remove the ‘+’ sign, Remove the commas, Convert to numeric
#clean installations
clean_installs <- function(Installs) {
Installs <- gsub("\\+", "", Installs)
Installs <- gsub(",", "", Installs)
return(as.numeric(Installs))
}
data_clean$Installs <- sapply(data_clean$Installs, clean_installs)
nan_rows <- sapply(data_clean[, c("Size", "Installs")], function(x) any(is.nan(x)))
# Display only rows that contain NaN in either Size or Installs
data_clean[,nan_rows]
## data frame with 0 columns and 9659 rows
datatable((data_clean), options = list(scrollX = TRUE ))
data_clean <- data_clean %>%
mutate(Rating = ifelse(is.na(Rating), mean(Rating, na.rm = TRUE), Rating))
# Identify the unique values in the 'Installs' column
unique_values <- unique(data_clean$Installs)
# Display the unique values
print(unique_values)
## [1] 1e+04 5e+05 5e+06 5e+07 1e+05 5e+04 1e+06 1e+07 5e+03 1e+08 1e+09 1e+03
## [13] 5e+08 5e+01 1e+02 5e+02 1e+01 1e+00 5e+00 0e+00
# Function to convert the installs to numeric
convert_to_numeric <- function(x) {
# Remove non-numeric characters and convert to numeric
as.numeric(gsub("[^0-9]", "", x)) * 10^(length(gregexpr(",", x)[[1]]) - 1)
}
# Sort unique values based on the custom numeric conversion
sorted_values <- unique_values[order(sapply(unique_values, convert_to_numeric))]
# Checking the type of the Rating
typeof(data_clean$Rating)
## [1] "double"
# Checking the type of the Reviews
typeof(data_clean$Reviews)
## [1] "character"
## chr [1:9659] "159" "967" "87510" "215644" "967" "167" "178" "36815" ...
## num [1:9659] 4.1 3.9 4.7 4.5 4.3 4.4 3.8 4.1 4.4 4.7 ...
As we can see the Review column is in string format which could be converted into int for more insights.
## 'data.frame': 9659 obs. of 13 variables:
## $ App : chr "Photo Editor & Candy Camera & Grid & ScrapBook" "Coloring book moana" "U Launcher Lite – FREE Live Cool Themes, Hide Apps" "Sketch - Draw & Paint" ...
## $ Category : chr "ART_AND_DESIGN" "ART_AND_DESIGN" "ART_AND_DESIGN" "ART_AND_DESIGN" ...
## $ Rating : num 4.1 3.9 4.7 4.5 4.3 4.4 3.8 4.1 4.4 4.7 ...
## $ Reviews : num 159 967 87510 215644 967 ...
## $ Size : num 19 14 8.7 25 2.8 5.6 19 29 33 3.1 ...
## $ Installs : num 1e+04 5e+05 5e+06 5e+07 1e+05 5e+04 5e+04 1e+06 1e+06 1e+04 ...
## $ Type : chr "Free" "Free" "Free" "Free" ...
## $ Price : num 0 0 0 0 0 0 0 0 0 0 ...
## $ Content.Rating: chr "Everyone" "Everyone" "Everyone" "Teen" ...
## $ Genres : chr "Art & Design" "Art & Design;Pretend Play" "Art & Design" "Art & Design" ...
## $ Last.Updated : chr "January 7, 2018" "January 15, 2018" "August 1, 2018" "June 8, 2018" ...
## $ Current.Ver : chr "1.0.0" "2.0.0" "1.2.4" "Varies with device" ...
## $ Android.Ver : chr "4.0.3 and up" "4.0.3 and up" "4.0.3 and up" "4.2 and up" ...
| App | Category | Rating | Reviews | Size | Installs | Type | Price | Content.Rating | Genres | Last.Updated | Current.Ver | Android.Ver | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| Min | Length:9659 | Length:9659 | Min. :1.000 | Min. : 0 | Min. : 0.0085 | Min. :0.000e+00 | Length:9659 | Min. : 0.000 | Length:9659 | Length:9659 | Length:9659 | Length:9659 | Length:9659 |
| Q1 | Class :character | Class :character | 1st Qu.:4.000 | 1st Qu.: 25 | 1st Qu.: 5.3000 | 1st Qu.:1.000e+03 | Class :character | 1st Qu.: 0.000 | Class :character | Class :character | Class :character | Class :character | Class :character |
| Median | Mode :character | Mode :character | Median :4.200 | Median : 967 | Median : 13.1000 | Median :1.000e+05 | Mode :character | Median : 0.000 | Mode :character | Mode :character | Mode :character | Mode :character | Mode :character |
| Mean | NA | NA | Mean :4.173 | Mean : 216593 | Mean : 20.1512 | Mean :7.778e+06 | NA | Mean : 1.099 | NA | NA | NA | NA | NA |
| Q3 | NA | NA | 3rd Qu.:4.500 | 3rd Qu.: 29401 | 3rd Qu.: 27.0000 | 3rd Qu.:1.000e+06 | NA | 3rd Qu.: 0.000 | NA | NA | NA | NA | NA |
| Max | NA | NA | Max. :5.000 | Max. :78158306 | Max. :100.0000 | Max. :1.000e+09 | NA | Max. :400.000 | NA | NA | NA | NA | NA |
There are 1463 missing values in rating.
As it could observed the Family category apps have the highest NA values. Let’s not drop them but handle them by replacing with the mean value for the category.
#Replace NA in Ratings with Overall Mean
data_clean <- data_clean %>%
mutate(Rating = ifelse(is.na(Rating), mean(Rating, na.rm = TRUE), Rating))
xkablesummary(data_clean)
| App | Category | Rating | Reviews | Size | Installs | Type | Price | Content.Rating | Genres | Last.Updated | Current.Ver | Android.Ver | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| Min | Length:9659 | Length:9659 | Min. :1.000 | Min. : 0 | Min. : 0.0085 | Min. :0.000e+00 | Length:9659 | Min. : 0.000 | Length:9659 | Length:9659 | Length:9659 | Length:9659 | Length:9659 |
| Q1 | Class :character | Class :character | 1st Qu.:4.000 | 1st Qu.: 25 | 1st Qu.: 5.3000 | 1st Qu.:1.000e+03 | Class :character | 1st Qu.: 0.000 | Class :character | Class :character | Class :character | Class :character | Class :character |
| Median | Mode :character | Mode :character | Median :4.200 | Median : 967 | Median : 13.1000 | Median :1.000e+05 | Mode :character | Median : 0.000 | Mode :character | Mode :character | Mode :character | Mode :character | Mode :character |
| Mean | NA | NA | Mean :4.173 | Mean : 216593 | Mean : 20.1512 | Mean :7.778e+06 | NA | Mean : 1.099 | NA | NA | NA | NA | NA |
| Q3 | NA | NA | 3rd Qu.:4.500 | 3rd Qu.: 29401 | 3rd Qu.: 27.0000 | 3rd Qu.:1.000e+06 | NA | 3rd Qu.: 0.000 | NA | NA | NA | NA | NA |
| Max | NA | NA | Max. :5.000 | Max. :78158306 | Max. :100.0000 | Max. :1.000e+09 | NA | Max. :400.000 | NA | NA | NA | NA | NA |
Now there are no missing values in reviews.
breaks = seq(15,20,by = 1)
frequency_table = table(data_clean$Rating)
frequency_table
##
## 1 1.2 1.4 1.5
## 16 1 3 3
## 1.6 1.7 1.8 1.9
## 4 8 8 11
## 2 2.1 2.2 2.3
## 12 8 14 20
## 2.4 2.5 2.6 2.7
## 19 20 24 23
## 2.8 2.9 3 3.1
## 40 45 81 69
## 3.2 3.3 3.4 3.5
## 63 100 126 156
## 3.6 3.7 3.8 3.9
## 167 224 286 359
## 4 4.1 4.17324304538799 4.2
## 513 621 1463 810
## 4.3 4.4 4.5 4.6
## 897 895 848 683
## 4.7 4.8 4.9 5
## 442 221 85 271
From above it can be seen all the rating are between 1 and 5.
# Checking the type of the Category
typeof(data_apps$Category)
## [1] "character"
length(unique(data_clean$Category))
## [1] 33
length(unique(data_clean$Genres))
## [1] 118
There are 33 categories in the the data frame with 118 genres. This means that in each category, there are multiple genres. Given that, the later analyses in this project can be proceeded with Category variable.
Below is the graph for the distribution of Categories for the dataset after removing duplicates.
Due to the inconsistent formatting of values in the
Current.Ver column, this column is dropped and will be
excluded from the analysis.
data_final <- data_clean %>% select(-c('Genres', 'Current.Ver'))
data_final$Category <- factor(data_final$Category)
data_final$Android.Ver <- factor(data_final$Android.Ver)
# Remove leading and trailing spaces and convert all text to a consistent format
data_final$Content.Rating <- trimws(tolower(data_final$Content.Rating))
cr_missing <- sum(is.na(data_final$`Content Rating`))
print(paste("Number of missing values in 'Content Rating':", cr_missing))
## [1] "Number of missing values in 'Content Rating': 0"
There are no missing values for Content rating.
# Convert Last Updated to Date format
data_final$Last.Updated <- as.Date(data_final$Last.Updated, format = "%B %d, %Y")
# Verify the cleaning
print("\nSummary of Last.Updated after cleaning:")
## [1] "\nSummary of Last.Updated after cleaning:"
print(summary(data_clean$Last.Updated))
## Length Class Mode
## 9659 character character
str(data_final)
## 'data.frame': 9659 obs. of 11 variables:
## $ App : chr "Photo Editor & Candy Camera & Grid & ScrapBook" "Coloring book moana" "U Launcher Lite – FREE Live Cool Themes, Hide Apps" "Sketch - Draw & Paint" ...
## $ Category : Factor w/ 33 levels "ART_AND_DESIGN",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ Rating : num 4.1 3.9 4.7 4.5 4.3 4.4 3.8 4.1 4.4 4.7 ...
## $ Reviews : num 159 967 87510 215644 967 ...
## $ Size : num 19 14 8.7 25 2.8 5.6 19 29 33 3.1 ...
## $ Installs : num 1e+04 5e+05 5e+06 5e+07 1e+05 5e+04 5e+04 1e+06 1e+06 1e+04 ...
## $ Type : chr "Free" "Free" "Free" "Free" ...
## $ Price : num 0 0 0 0 0 0 0 0 0 0 ...
## $ Content.Rating: chr "everyone" "everyone" "everyone" "teen" ...
## $ Last.Updated : Date, format: "2018-01-07" "2018-01-15" ...
## $ Android.Ver : Factor w/ 34 levels "1.0 and up","1.5 and up",..: 16 16 16 19 21 9 16 19 11 16 ...
# Count Plot for the Price distribution
ggplot(data_final, aes(x=Price)) +
geom_histogram(binwidth=2, fill="pink", color="black") +
xlim(0, 500) + ylim(0, 500) +
labs(title="Price Distribution", x="Price", y="Frequency") +
theme_minimal()
The data is highly skewed as there are many zero price entries.
# Boxplot for the same
ggplot(data_final, aes(y=Price)) +
geom_boxplot(outlier.colour = "red", outlier.shape = 16, outlier.size = 1, fill="pink", color="black") +
labs(title="Price Boxplot", y="Price") +
theme_minimal()
outlierKD2 <- function(df, var, rm = FALSE, boxplt = FALSE, histogram = TRUE, qqplt = FALSE) {
dt <- df # Duplicate the dataframe for potential alteration
var_name <- eval(substitute(var), eval(dt))
na1 <- sum(is.na(var_name))
m1 <- mean(var_name, na.rm = TRUE)
colTotal <- boxplt + histogram + qqplt # Calculate the total number of charts to be displayed
par(mfrow = c(2, max(2, colTotal)), oma = c(0, 0, 3, 0)) # Adjust layout for plots
# Q-Q plot with custom title
if (qqplt) {
qqnorm(var_name, main="Q-Q plot without Outliers")
qqline(var_name)
}
# Histogram with custom title
if (histogram) {
hist(var_name,main = "Histogram without Outliers", xlab = NA, ylab = NA)
}
# Box plot with custom title
if (boxplt) {
boxplot(var_name, main= "Box Plot without Outliers")
}
# Identify outliers
outlier <- boxplot.stats(var_name)$out
mo <- mean(outlier)
var_name <- ifelse(var_name %in% outlier, NA, var_name)
# Q-Q plot without outliers
if (qqplt) {
qqnorm(var_name, main="Q-Q plot with Outliers")
qqline(var_name)
}
# Histogram without outliers
if (histogram) {
hist(var_name, main = "Histogram with Outliers", xlab = NA, ylab = NA)
}
# Box plot without outliers
if (boxplt) {
boxplot(var_name, main = "Boxplot with Outliers")
}
# Add the title for the overall plot section if any plots are displayed
if (colTotal > 0) {
title("Outlier Check", outer = TRUE)
na2 <- sum(is.na(var_name))
cat("Outliers identified:", na2 - na1, "\n")
cat("Proportion (%) of outliers:", round((na2 - na1) / sum(!is.na(var_name)) * 100, 1), "\n")
cat("Mean of the outliers:", round(mo, 2), "\n")
cat("Mean without removing outliers:", round(m1, 2), "\n")
cat("Mean if we remove outliers:", round(mean(var_name, na.rm = TRUE), 2), "\n")
}
}
#outlier function is defined in previous chunck of code.
outlier_check_price = outlierKD2(data_final, Price, rm = FALSE, boxplt = TRUE, qqplt = TRUE)
## Outliers identified: 756
## Proportion (%) of outliers: 8.5
## Mean of the outliers: 14.05
## Mean without removing outliers: 1.1
## Mean if we remove outliers: 0
The price values in the dataset, including both typical and extreme values, are valid observations for our analysis. As such, removing these outliers may not be beneficial for our study.
#To check the value ranges
table(data_final$Price)
##
## 0 0.99 1 1.04 1.2 1.26 1.29 1.49 1.5 1.59 1.61
## 8903 145 3 1 1 1 1 46 1 1 1
## 1.7 1.75 1.76 1.96 1.97 1.99 2 2.49 2.5 2.56 2.59
## 2 1 1 1 1 73 3 25 1 1 1
## 2.6 2.9 2.95 2.99 3.02 3.04 3.08 3.28 3.49 3.61 3.88
## 1 1 1 124 1 1 1 1 7 1 1
## 3.9 3.95 3.99 4.29 4.49 4.59 4.6 4.77 4.8 4.84 4.85
## 1 1 57 1 9 1 1 1 1 1 1
## 4.99 5 5.49 5.99 6.49 6.99 7.49 7.99 8.49 8.99 9
## 70 1 5 26 5 11 2 7 2 5 1
## 9.99 10 10.99 11.99 12.99 13.99 14 14.99 15.46 15.99 16.99
## 19 2 2 3 4 2 1 9 1 1 2
## 17.99 18.99 19.4 19.9 19.99 24.99 25.99 28.99 29.99 30.99 33.99
## 2 1 1 1 5 3 1 1 5 1 1
## 37.99 39.99 46.99 74.99 79.99 89.99 109.99 154.99 200 299.99 379.99
## 1 2 1 1 1 1 1 1 1 1 1
## 389.99 394.99 399.99 400
## 1 1 12 1
As aldready mentioned, there are 8903 free apps (More apps with price as 0).
# Bar Plot for the Type Distribution
ggplot(data_final, aes(x = Type)) +
geom_bar(fill = "pink", color = "black") +
labs(title = "Distribution of App Types (Free vs Paid)", x = "Type", y = "Count") +
theme_minimal()
As it is clear, there are more free apps.
#Display statistics for the Price of apps grouped by their Type
data_final$Type <- as.factor(data_final$Type)
summary_by_type <- data.frame(
Type = levels(data_final$Type),
Min_Price = tapply(data_clean$Price, data_clean$Type, min, na.rm = TRUE),
Max_Price = tapply(data_clean$Price, data_clean$Type, max, na.rm = TRUE),
Mean_Price = tapply(data_clean$Price, data_clean$Type, mean, na.rm = TRUE),
Median_Price = tapply(data_clean$Price, data_clean$Type, median, na.rm = TRUE)
)
print(summary_by_type)
## Type Min_Price Max_Price Mean_Price Median_Price
## Free Free 0.00 0 0.00000 0.00
## NaN NaN 0.00 0 0.00000 0.00
## Paid Paid 0.99 400 14.04515 2.99
#Scatter plot for price distribution by app type
ggplot(data_final, aes(x = Type, y = Price, fill = Type)) +
geom_boxplot() +
labs(title = "Price Distribution by App Type",
x = "App Type",
y = "Price ($)") +
theme_minimal()
ggplot(data_final, aes(x = Price, fill = Type)) +
geom_histogram(binwidth = 60, alpha = 0.7, position = "identity") +
facet_wrap(~ Type) +
labs(title = "Price Distribution by App Type",
x = "Price ($)",
y = "Count") +
theme_minimal()
Upon analyzing the price distribution across different app types, we found that some values in the Type column do not accurately represent the app prices (from above plot). Since we can fully rely on the Price values for our analysis, the Type column is seemed unnecessary.
Hence, Removing the Type column…
#Using subset function
data_final <- subset(data_final, select = -Type)
#After removing the Type column and duplicated values
str(data_final)
## 'data.frame': 9659 obs. of 10 variables:
## $ App : chr "Photo Editor & Candy Camera & Grid & ScrapBook" "Coloring book moana" "U Launcher Lite – FREE Live Cool Themes, Hide Apps" "Sketch - Draw & Paint" ...
## $ Category : Factor w/ 33 levels "ART_AND_DESIGN",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ Rating : num 4.1 3.9 4.7 4.5 4.3 4.4 3.8 4.1 4.4 4.7 ...
## $ Reviews : num 159 967 87510 215644 967 ...
## $ Size : num 19 14 8.7 25 2.8 5.6 19 29 33 3.1 ...
## $ Installs : num 1e+04 5e+05 5e+06 5e+07 1e+05 5e+04 5e+04 1e+06 1e+06 1e+04 ...
## $ Price : num 0 0 0 0 0 0 0 0 0 0 ...
## $ Content.Rating: chr "everyone" "everyone" "everyone" "teen" ...
## $ Last.Updated : Date, format: "2018-01-07" "2018-01-15" ...
## $ Android.Ver : Factor w/ 34 levels "1.0 and up","1.5 and up",..: 16 16 16 19 21 9 16 19 11 16 ...
Let’s do bivariate analysis on price and other variables starting from here.
#Plotting a scatter plot between Price and installs
ggplot(data_final, aes(x=Price, y=log(data_clean$Installs))) +
geom_point(color = 'red', size = 1, alpha = 0.5) +
geom_smooth(method = 'lm', color = 'blue', se = FALSE) +
labs(title = "Price vs Installs", x = "Price (USD)", y = "Number of Installs") +
theme_minimal() +
theme(axis.text.x = element_text(angle = 45, hjust = 1))
From the scatter plot, we can see that there are more number of
installations with price value 0.
# Categorize the apps as "Free" or "Paid" based on Price
Price_Category <- ifelse(data_final$Price == 0, "Free", "Paid")
str(data_final$Price)
## num [1:9659] 0 0 0 0 0 0 0 0 0 0 ...
str(Price_Category)
## chr [1:9659] "Free" "Free" "Free" "Free" "Free" "Free" "Free" "Free" ...
#str(log(data_clean$Installs))
For a better visualization, we are categorizing price values 0 as free apps and plotting abox plot.
# Box plot of Price Category vs. log-transformed Installs
ggplot(data_final, aes(x = Price_Category, y = log(data_clean$Installs))) +
geom_boxplot(fill = "lightblue", color = "darkblue", alpha = 0.6) +
labs(title = "Price Categories vs. Log-Transformed Installs",
x = "Price Category",
y = "Log(Installs)") +
theme_minimal() +
theme(axis.text.x = element_text(angle = 45, hjust = 1))
“Free” apps tend to have more installs than “Paid” apps. The difference between the means on the log scale is estimated to be between 3.47 and 3.97.
# Categorize the apps as "Free" or "Paid" based on Price
Price_Category <- ifelse(data_final$Price == 0, "Free", "Paid")
str(data_final$Price)
## num [1:9659] 0 0 0 0 0 0 0 0 0 0 ...
str(Price_Category)
## chr [1:9659] "Free" "Free" "Free" "Free" "Free" "Free" "Free" "Free" ...
#str(data_final$log(data_clean$Installs))
table(Price_Category)
## Price_Category
## Free Paid
## 8903 756
# Add Price_Category to data_final
data_duplicate <- data_final
data_duplicate$Price_Category <- ifelse(data_final$Price == 0, "Free", "Paid")
# Create a summarized table for Price_Category and log_Installs
summary_table <- data_duplicate %>%
group_by(Price_Category) %>%
summarise(Average_Log_Installs = mean(log(data_clean$Installs), na.rm = TRUE),
Count = n())
# View the summarized table
kable(summary_table, format = "html", col.names = c("Price Category", "Mean Log(Installs)", "App Count")) %>%
kable_styling(full_width = FALSE, position = "center")
| Price Category | Mean Log(Installs) | App Count |
|---|---|---|
| Free | -Inf | 8903 |
| Paid | -Inf | 756 |
# Plot Price vs. Reviews
ggplot(data_final, aes(x=Price, y=Reviews)) +
geom_point(color = 'blue') +
geom_smooth(method = 'lm', color = 'red', se = FALSE) +
labs(title = "Price vs Reviews", x = "Price (USD)", y = "Number of Reviews") +
theme_minimal() +
theme(
panel.background = element_rect(fill = "white"), # Set panel background to white
plot.background = element_rect(fill = "white"), # Set plot background to white
axis.text.x = element_text(angle = 45, hjust = 1)
)
# Plot Price vs. Rating
ggplot(data_final, aes(x=Price, y=Rating)) +
geom_point(color = 'green') +
geom_smooth(method = 'lm', color = 'red', se = FALSE) +
labs(title = "Price vs Rating", x = "Price (USD)", y = "Rating") +
theme_minimal() +
theme(
panel.background = element_rect(fill = "white"), # Set panel background to white
plot.background = element_rect(fill = "white"), # Set plot background to white
axis.text.x = element_text(angle = 45, hjust = 1)
)
Price vs Reviews with installation: Cheaper products tend to have more reviews, indicating higher popularity or more frequent purchases. In contrast, expensive products tend to have fewer reviews, possibly because fewer people buy higher-priced items.
Price vs Ratings with installation: Price does not strongly affect the average rating, but there is a slight trend where lower-priced products have more variation in ratings, while higher-priced products tend to receive more consistent ratings around 4. May be higher price apps are meeting the customer expectations.
# Scatter plot of Price vs. Ratings with log_Installs as color
ggplot(data_final, aes(x = Price, y = Rating,color = log(data_clean$Installs))) +
geom_point(alpha = 0.6) +
scale_color_gradient(low = "blue", high = "red") +
labs(title = "Price vs. Ratings with Installs as Color by Price",
x = "Price",
y = "Rating",
color = "log(Installs)") +
theme_minimal()
# Scatter plot of Price vs. Reviews with log_Installs as color
ggplot(data_final, aes(x = Price, y = Reviews,color = log(data_clean$Installs))) +
geom_point(alpha = 0.6) +
scale_color_gradient(low = "darkgreen", high = "yellow") +
labs(title = "Price vs. reviewss with Installs as Color by Price",
x = "Price",
y = "Reviews",
color = "log(Installs)") +
theme_minimal()
Concluding: Apps with lower prices, have more ratings and installs while apps priced higher tend to have fewer installs and more scattered ratings. Similarly, for reviews.
# Plot Price vs Size
ggplot(data_final, aes(x=Price, y=Size)) +
geom_point(color = 'red') +
geom_smooth(method = 'lm', color = 'blue', se = FALSE) +
labs(title = "Price vs Size", x = "Price (USD)", y = "App Size (MB)") +
theme_minimal()
# Create a new data frame to store the factor levels
data_clean1_factor <- data_final # Assuming you want to keep the original data intact
data_clean1_factor$Installs <- factor(data_final$Installs, levels = sorted_values)
# Define new breaks for more even intervals for Installs
install_breaks <- c(c(0, 500, 1000, 5000, 10000, 50000, 100000, 300000, 1000000, 5000000,10000000, Inf))
# Create a categorical variable for installs based on these breaks
data_clean1_factor$Installs_Category <- cut(
as.numeric(as.character(data_final$Installs)),
breaks = install_breaks,
right = FALSE,
labels = c("0+", "500+", "1K+", "5K+", "10K+", "50K+", "100K+", "300K+", "1M+", "5M+","Above 10M+")
)
# Plot the categorized Installs data
library(ggplot2)
ggplot(data_clean1_factor, aes(x = Installs_Category)) +
geom_bar() +
xlab("Installs") +
ylab("Count") +
theme(axis.text.x = element_text(angle = 45, hjust = 1)) +
ggtitle("Distribution of App Installs by Category")
#### Installs vs Size
ggplot(data_clean, aes(x = Size, y = log(Installs))) +
geom_hex(bins = 30) +
scale_fill_viridis_c() + # Adds color gradient
labs(title = "Plot of App Size vs. Installs (Log Scale)",
x = "Size (MB)",
y = "Installs (Log Scale)") +
theme_minimal()
boxplot(data_final$Rating,ylab = "Rating", xlab = "Count",col = "Blue")
hist(data_clean$Rating, main="Histogram of Apps Rating after cleaning", xlab="Rating (count)", col = 'blue', breaks = 100 )
qqnorm(data_clean$Rating)
qqline(data_clean$Rating, col = "red")
Here, it could be seen the plots are much clearer but still skewed due to other outliers from 1-3 rating but as these may be the reason from which we could find why the apps are low rated hencecannot be removed from our dataset.
boxplot(data_final$Reviews,ylab = "Reviews", xlab = "Count",col = 'Blue')
hist(data_final$Reviews, main="Histogram of Apps Reviews", xlab="Reviews (count)", col = 'blue', breaks = 100 )
ggplot(data_final, aes(x = log(Reviews))) +
geom_histogram(binwidth = 0.1, fill = "blue", color = "black") +
labs(title = "Log-Transformed Histogram of Ratings", x = "Log(Rating)", y = "Count")
qqnorm(data_final$Reviews)
qqline(data_final$Reviews, col = "red")
Similar to the case of ratings the plots are skewed due to the outliers. Hence, we can use the log plot of reviews for the visualisation which is normalised version of Reviews. As they are skewed, they donot follow normal distribution.
xkablesummary(data_final)
| App | Category | Rating | Reviews | Size | Installs | Price | Content.Rating | Last.Updated | Android.Ver | |
|---|---|---|---|---|---|---|---|---|---|---|
| Min | Length:9659 | FAMILY :1832 | Min. :1.000 | Min. : 0 | Min. : 0.0085 | Min. :0.000e+00 | Min. : 0.000 | Length:9659 | Min. :2010-05-21 | 4.1 and up :2202 |
| Q1 | Class :character | GAME : 959 | 1st Qu.:4.000 | 1st Qu.: 25 | 1st Qu.: 5.3000 | 1st Qu.:1.000e+03 | 1st Qu.: 0.000 | Class :character | 1st Qu.:2017-08-05 | 4.0.3 and up :1395 |
| Median | Mode :character | TOOLS : 827 | Median :4.200 | Median : 967 | Median : 13.1000 | Median :1.000e+05 | Median : 0.000 | Mode :character | Median :2018-05-04 | 4.0 and up :1285 |
| Mean | NA | BUSINESS : 420 | Mean :4.173 | Mean : 216593 | Mean : 20.1512 | Mean :7.778e+06 | Mean : 1.099 | NA | Mean :2017-10-30 | Varies with device: 990 |
| Q3 | NA | MEDICAL : 395 | 3rd Qu.:4.500 | 3rd Qu.: 29401 | 3rd Qu.: 27.0000 | 3rd Qu.:1.000e+06 | 3rd Qu.: 0.000 | NA | 3rd Qu.:2018-07-17 | 4.4 and up : 818 |
| Max | NA | PERSONALIZATION: 376 | Max. :5.000 | Max. :78158306 | Max. :100.0000 | Max. :1.000e+09 | Max. :400.000 | NA | Max. :2018-08-08 | 2.3 and up : 616 |
| NA | NA | (Other) :4850 | NA | NA | NA | NA | NA | NA | NA | (Other) :2353 |
outlierKD2(data_final, Reviews)
## Outliers identified: 1656
## Proportion (%) of outliers: 20.7
## Mean of the outliers: 1228141
## Mean without removing outliers: 216592.6
## Mean if we remove outliers: 7280.61
To check which are outliers lets make sections of data that is create bins to check which bins have maximum data, this would help us see how reviews are distributed.
Binning into equal count in each bin to check averge rating for each bin
# Define the new custom breaks for bins
# Ensure there are no NA values
# Define new breaks for more even intervals
breaks <- c(0, 100, 500, 1000, 2500, 5000, 10000, 25000,50000,100000, 300000,1000000,Inf)
# Create a categorical variable based on the new breaks
Review_Category <- cut(data_final$Reviews, breaks = breaks, right = FALSE,
labels = c("0+","100+", "500+", "1K+",
"2.5K+", "5K+", "10K+","25K+",
"50K+", "100K+","300K+","1M+"))
# Count the number of values in each bin
bin_counts <- as.data.frame(table(Review_Category))
# Rename the columns for clarity
colnames(bin_counts) <- c("Review_Category", "Count")
# Print the counts
print(bin_counts)
## Review_Category Count
## 1 0+ 3327
## 2 100+ 1065
## 3 500+ 462
## 4 1K+ 586
## 5 2.5K+ 475
## 6 5K+ 474
## 7 10K+ 719
## 8 25K+ 606
## 9 50K+ 498
## 10 100K+ 647
## 11 300K+ 451
## 12 1M+ 349
# Create a line plot of the binned counts
ggplot(bin_counts, aes(x = Review_Category, y = Count, group = 1)) +
geom_line(color = "blue", size = 1) +
geom_point(color = "blue", size = 3) +
labs(title = "Count of Reviews by Review Category",
x = "Review Category",
y = "Count of Reviews") +
theme_minimal() +
theme(axis.text.x = element_text(angle = 45, hjust = 1)) # Rotate x-axis labels for readability
Hence, high reviews can be observed in less apps and less reviews can be observed in more apps which is expected.
boxplot( data_final$Rating~ Review_Category, data = data_clean,
main = "Boxplot of Review Counts by Review Category",
xlab = "Review Category",
ylab = "Review Rating",
las = 2, # Rotate the x-axis labels for readability
col = "lightblue") # Optional: Set color for the boxplots
In this we could observe that, as reviews increase the median of rating increased and the values clustered around higher ratings which could show that high reviews, could mean a better rated app.
# Calculate the mean Rating for each Review_Category
mean_ratings <- tapply(data_final$Rating, Review_Category, mean, na.rm = TRUE)
# Convert the result to a data frame for better readability
mean_ratings_df <- data.frame(Review_Category = names(mean_ratings), Mean_Rating = as.numeric(mean_ratings))
# Print the mean ratings for each review bin
print(mean_ratings_df)
## Review_Category Mean_Rating
## 1 0+ 4.126221
## 2 100+ 4.029538
## 3 500+ 4.063188
## 4 1K+ 4.107030
## 5 2.5K+ 4.129572
## 6 5K+ 4.191139
## 7 10K+ 4.221836
## 8 25K+ 4.231848
## 9 50K+ 4.293775
## 10 100K+ 4.329830
## 11 300K+ 4.375610
## 12 1M+ 4.426361
# Define correct order of Review_Category as a factor
mean_ratings_df$Review_Category <- factor(mean_ratings_df$Review_Category,
levels = c("0+","100+", "500+", "1K+",
"2.5K+", "5K+", "10K+","25K+",
"50K+", "100K+", "300K+", "1M+"))
# Plot the mean ratings for each review bin in the correct order
ggplot(mean_ratings_df, aes(x = Review_Category, y = Mean_Rating)) +
geom_bar(stat = "identity", fill = "steelblue") + # Use bar plot
labs(title = "Mean Rating by Review Category",
x = "Review Category",
y = "Mean Rating") +
theme_minimal() +
theme(axis.text.x = element_text(angle = 45, hjust = 1)) # Rotate x-axis labels for readability
As we can see, the mean rating increases as the reviews increase.
# Create a new data frame for plotting
plot_data <- data.frame(Rating = data_final$Rating, Review_Category = Review_Category)
# Create a histogram of Ratings, faceted by Review_Category
ggplot(plot_data, aes(x = Rating)) +
geom_histogram(bins = 30, fill = "blue", alpha = 0.7) +
facet_wrap(~ Review_Category, labeller = label_wrap_gen()) + # Facet by Review_Category
theme_minimal() +
labs(title = "Histograms of Ratings by Review Category", x = "Rating", y = "Frequency")
This is another representation of ratings vs reviews.
# Scatter plot for Installs vs Reviews
ggplot(data_clean1_factor, aes(x = Reviews, y = Installs)) +
geom_point(color = "blue", alpha = 0.5) +
labs(title = "Scatter Plot of Installs vs Reviews",
x = "Number of Reviews",
y = "Number of Installs") +
theme_minimal()

# Scatter plot of log-transformed Installs vs. Rating
ggplot(data_final, aes(x = log(data_clean$Installs), y = Rating)) +
geom_point(color = "blue", alpha = 0.6) +
geom_smooth(method = "lm", color = "red", se = FALSE) + # Add a regression line
labs(title = "Log-Transformed Installs vs. Rating",
x = "Log(Installs)",
y = "Rating") +
theme_minimal()

category_counts <- table(data_final$Category)
# Convert to data frame for plotting
category_counts_df <- as.data.frame(category_counts)
colnames(category_counts_df) <- c("Category", "Frequency")
ggplot(category_counts_df, aes(x = reorder(Category, Frequency), y = Frequency)) +
geom_bar(stat = "identity", fill = "#1f3374") +
geom_text(aes(label = Frequency), vjust = 0.5, hjust=1, size=2.5, color='#f8c220') +
coord_flip() +
labs(title = "Distribution of Categories", x = "Category", y = "Frequency") +
theme_minimal() +
theme(
plot.background = element_rect(fill = "#efefef", color = NA),
panel.background = element_rect(fill = "#efefef", color = NA),
axis.text.y = element_text(size = 5.5)
)
AS it can be seen from the graph above, most of the apps in the dataset belong to the Family category, and Beauty has the least number of apps.
Below is a boxplot show the distribution of number of installs for each category order by mean from highest to lowest.
ggplot(data_clean, aes(x = reorder(Category, log(data_final$Installs), FUN = mean), y = log(data_clean$Installs))) +
geom_boxplot(outlier.color = "#f05555", outlier.shape = 1, color='#1f3374', fill="#efefef") + # Red outliers for emphasis
coord_flip() + # Flip for better readability
scale_y_log10() +
theme_minimal() +
labs(title = "Distribution of Installs by Category",
x = "Category",
y = "Number of Installs (Log Scale)") +
theme(
plot.background = element_rect(fill = "#efefef", color = NA),
panel.background = element_rect(fill = "#efefef", color = NA),
axis.text.y = element_text(size = 5.5)
)
It can be seen from the graph that, on average, Entertainment apps
receive the highest number of installations, followed by Education,
Game, Photography, and Weather apps. In contrast, Art & Design apps
have the fewest installations.
Below is the figure showing the distribution of app sizes in each category.
#df_clean <- data_clean %>%
# mutate(Size = sapply(Size, convert_size)) %>%
# filter(!is.na(Size))
# Plot the histogram with faceting by category
ggplot(data_clean, aes(x = Size)) +
geom_histogram(binwidth = 5, fill = "#304ba6", color = "black") +
facet_wrap(~ Category, scales = "free_y") +
theme_minimal() +
labs(
title = "Distribution of App Sizes by Category",
x = "Size (MB)",
y = "Count"
) +
theme(
strip.text = element_text(size = 5),
axis.text.x = element_text(size = 7, angle = 45, hjust = 1)
)
str(data_clean)
## 'data.frame': 9659 obs. of 13 variables:
## $ App : chr "Photo Editor & Candy Camera & Grid & ScrapBook" "Coloring book moana" "U Launcher Lite – FREE Live Cool Themes, Hide Apps" "Sketch - Draw & Paint" ...
## $ Category : chr "ART_AND_DESIGN" "ART_AND_DESIGN" "ART_AND_DESIGN" "ART_AND_DESIGN" ...
## $ Rating : num 4.1 3.9 4.7 4.5 4.3 4.4 3.8 4.1 4.4 4.7 ...
## $ Reviews : num 159 967 87510 215644 967 ...
## $ Size : num 19 14 8.7 25 2.8 5.6 19 29 33 3.1 ...
## $ Installs : num 1e+04 5e+05 5e+06 5e+07 1e+05 5e+04 5e+04 1e+06 1e+06 1e+04 ...
## $ Type : chr "Free" "Free" "Free" "Free" ...
## $ Price : num 0 0 0 0 0 0 0 0 0 0 ...
## $ Content.Rating: chr "Everyone" "Everyone" "Everyone" "Teen" ...
## $ Genres : chr "Art & Design" "Art & Design;Pretend Play" "Art & Design" "Art & Design" ...
## $ Last.Updated : chr "January 7, 2018" "January 15, 2018" "August 1, 2018" "June 8, 2018" ...
## $ Current.Ver : chr "1.0.0" "2.0.0" "1.2.4" "Varies with device" ...
## $ Android.Ver : chr "4.0.3 and up" "4.0.3 and up" "4.0.3 and up" "4.2 and up" ...
ggplot(data_clean, aes(x = reorder(Category, Size, FUN = median), y = Size)) +
geom_boxplot(outlier.color = "#f05555", outlier.shape = 1) +
coord_flip() +
theme_minimal() +
labs(
title = "Boxplot of App Sizes by Category (Ordered by Median)",
x = "Category",
y = "Size (MB)"
) +
theme(
strip.text = element_text(size = 8),
axis.text.x = element_text(size = 7, angle = 45, hjust = 1)
)
As it can be seen from the two figures above, most categories exhibit right-skewed app sizes, with the majority being under 50MB. However, the Game category stands out with a significantly larger median app size compared to other categories.
Below is the graph displaying the distribution of reviews left by users for each category.
df_aggregated <- data_final %>%
group_by(Category) %>%
summarise(Total_Reviews = sum(Reviews, na.rm = TRUE))
#df_aggregated
# Plot the total reviews by category using a bar chart
ggplot(df_aggregated, aes(x = reorder(Category, -Total_Reviews), y = log10(Total_Reviews))) +
geom_bar(stat = "identity", fill = "#1f3374") +
labs(
title = "Log-Scaled Total Reviews by Category",
x = "Category",
y = "Log10(Total Number of Reviews)"
) +
theme_minimal() +
theme(axis.text.x = element_text(angle = 45, hjust = 1))
AS it can be seen that game apps have most reviews while events apps
have the least reviews.
Below is the figure demonstrating the distribution of number of rating for each category.
ggplot(data_final, aes(x = Rating)) +
geom_histogram(binwidth = 0.5, fill = "#1f3374", color='#efefef') +
facet_wrap(~ Category, scales = "free_y") + # Facet by Category with independent y-axis
scale_x_continuous(limits = c(1, 5), breaks = seq(1, 5, by = 0.5)) + # Restrict x-axis to 1-5
theme_minimal() +
labs(
title = "Distribution of Ratings by Category",
x = "Rating",
y = "Count"
) +
theme(
strip.text = element_text(size = 5), # Adjust facet label size
axis.text.x = element_text(size = 5, angle = 45, hjust = 1), # Rotate x-axis labels
plot.title = element_text(hjust = 0.5) # Center the plot title
)
As illustrated in the graph above, all categories have app ratings that
range between 4.0 and 5.0.
Below is the figure showing the distribution of Android versions.
df_clean <- data_final %>%
filter(!is.na(Android.Ver) & !is.na(Reviews) & !(Android.Ver == 'NaN'))
extract_version <- function(version) {
version <- tolower(version) # Make lowercase for consistency
# Handle "Varies with device" and "NaN"
if (version == "varies with device" || version == "nan") return(NA)
# Extract the first version in case of ranges (e.g., "4.1 - 7.1.1" -> "4.1")
first_version <- strsplit(version, "[- ]")[[1]][1]
# Remove "and up" if present (e.g., "4.0 and up" -> "4.0")
first_version <- gsub("and up", "", first_version)
return(as.numeric(first_version)) # Convert to numeric
}
df_clean <- data_final %>%
mutate(Android_Ver = sapply(Android.Ver, extract_version)) %>%
filter(!is.na(Android_Ver)) # Remove rows with NA in Android_Ver
android_installs <- data_final %>%
group_by(Android.Ver) %>%
summarize(Total_Installs = sum(Installs, na.rm = TRUE))
ggplot(df_clean, aes(x = Android_Ver)) +
geom_histogram(binwidth = 0.5, fill = "#1f3374", color='#efefef') +
scale_x_continuous(breaks = seq(1, 8, by = 1.0)) + # Set x-axis ticks from 1.0 to 8.0
theme_minimal() +
labs(
title = "Distribution of Android Versions",
x = "Android Version",
y = "Count"
) +
theme(axis.text.x = element_text(angle = 45, hjust = 1))
As it can be seen that, the minimum required Android Version for most
apps is 4.0 and up.
extract_version <- function(version) {
version <- tolower(version) # Make lowercase for consistency
# Handle "Varies with device" and "NaN"
if (version == "varies with device" || version == "nan") return(NA)
# Extract the first version in case of ranges (e.g., "4.1 - 7.1.1" -> "4.1")
first_version <- strsplit(version, "[- ]")[[1]][1]
# Remove "and up" if present (e.g., "4.0 and up" -> "4.0")
first_version <- gsub("and up", "", first_version)
return(as.numeric(first_version)) # Convert to numeric
}
Below is the graph showing the number of installs for each minimum required Android Version.
ggplot(df_clean, aes(x = reorder(Android.Ver, Installs), y = Installs)) +
geom_bar(stat = "identity", fill = "#1f3374") +
coord_flip() + # Flip coordinates for better readability
scale_y_continuous(labels = scales::comma) + # Format y-axis with commas
theme_minimal() +
labs(
title = "Total Installs by Android Version",
x = "Android Version",
y = "Total Installs"
) +
theme(
axis.text.y = element_text(size = 8), # Adjust y-axis text size
plot.title = element_text(hjust = 0.5) # Center the plot title
)
It can be seen that the highest number of installation is when there is different requirements of the versions for the app to run.
Below is the distribution of reviews for each minimum required Android Version.
df_clean <- data_final %>%
filter(!is.na(Android.Ver) & !is.na(Reviews) & !(Android.Ver == 'NaN')) %>%
mutate(Scaled_Reviews = log10(Reviews + 1))
ggplot(df_clean, aes(x = reorder(Android.Ver, Scaled_Reviews, FUN = median), y = Scaled_Reviews)) +
geom_boxplot(outlier.color = "#f05555", outlier.shape = 1) + # Boxplot with red outliers
coord_flip() + # Flip coordinates for better readability
theme_minimal() +
labs(
title = "Distribution of Scaled Reviews by Android Version",
x = "Android Version",
y = "Scaled Reviews (Log10)"
) +
theme(
axis.text.y = element_text(size = 8), # Adjust y-axis text size
plot.title = element_text(hjust = 0.5) # Center the plot title
)
It can be seen that the version from 4.1 to 7.1.1 have the highest number of reviews, whiel version from 5.0 to 7.1.1 have the least number of reviews.
Below is the plot showing the number of ratings for each Android Version.
ggplot(df_clean, aes(x = Rating, fill = Android.Ver)) +
geom_histogram(binwidth = 0.5, position = "stack", color = "black", alpha = 0.7) +
scale_x_continuous(breaks = seq(1, 5, by = 0.5)) + # Set x-axis breaks
theme_minimal() +
labs(
title = "Histogram of Ratings by Android Version",
x = "Rating",
y = "Count"
) +
theme(
axis.text.x = element_text(size = 8),
axis.text.y = element_text(size = 8),
plot.title = element_text(hjust = 0.5) # Center the plot title
)
It can be seen that most Android Version have ratings range between 4.0
and 5.0.
# Clean and prepare the Last Updated and Content column
data_updated <- data_final %>%
mutate(
Content.Rating = as.factor(Content.Rating)
)
# 1. Content Rating Distribution
content_rating_dist <- table(data_updated$Content.Rating)
print("Content Rating Distribution:")
## [1] "Content Rating Distribution:"
print(content_rating_dist)
##
## adults only 18+ everyone everyone 10+ mature 17+ teen
## 3 7903 322 393 1036
## unrated
## 2
# Bar plot for Content Rating
ggplot(data_final, aes(x = Content.Rating)) +
geom_bar(fill = "skyblue") +
geom_text(stat = "count", aes(label = ..count..), vjust = -0.5) +
labs(title = "Distribution of App Content Ratings",
x = "Content Rating",
y = "Number of Apps") +
theme_minimal() +
theme(axis.text.x = element_text(angle = 45, hjust = 1))
Everyone is the most dominant Category with 81.82% of all apps and
Adults 18+ being most least significant category with about 0.03% of
overall app population
# Last Updated Analysis
# Create summary of updates by month and year
updates_by_month <- data_updated %>%
mutate(
update_month = format(Last.Updated, "%Y-%m"),
update_year = year(Last.Updated)
) %>%
group_by(update_month) %>%
summarize(count = n()) %>%
arrange(update_month)
# Plot updates over time
#ggplot(updates_by_month, aes(x = as.Date(paste0(update_month, "-01")), y = count)) +
#geom_line(color = "blue") +
#geom_point(color = "red") +
#labs(title = "Number of App Updates Over Time",
# x = "Date",
# y = "Number of Updates") +
#theme_minimal() +
# theme(axis.text.x = element_text(angle = 45, hjust = 1))
# Content Rating and Update Frequency Relationship
update_frequency_by_rating <- data_updated %>%
group_by(Content.Rating) %>%
summarize(
avg_last_update = mean(Last.Updated),
median_last_update = median(Last.Updated),
n_apps = n()
)
print("\nUpdate Frequency by Content Rating:")
## [1] "\nUpdate Frequency by Content Rating:"
print(update_frequency_by_rating)
## # A tibble: 6 × 4
## Content.Rating avg_last_update median_last_update n_apps
## <fct> <date> <date> <int>
## 1 adults only 18+ 2018-07-20 2018-07-24 3
## 2 everyone 2017-10-20 2018-04-20 7903
## 3 everyone 10+ 2017-11-24 2018-06-06 322
## 4 mature 17+ 2018-02-18 2018-07-09 393
## 5 teen 2017-12-03 2018-06-05 1036
## 6 unrated 2013-10-25 2013-10-25 2
# Basic statistics for Installs by Content Rating
installs_by_rating <- data_updated %>%
group_by(Content.Rating) %>%
summarise(
mean_installs = mean(Installs, na.rm = TRUE),
median_installs = median(Installs, na.rm = TRUE),
total_installs = sum(Installs, na.rm = TRUE),
n_apps = n()
) %>%
arrange(desc(mean_installs))
print("Summary of Installs by Content Rating:")
## [1] "Summary of Installs by Content Rating:"
print(installs_by_rating)
## # A tibble: 6 × 5
## Content.Rating mean_installs median_installs total_installs n_apps
## <fct> <dbl> <dbl> <dbl> <int>
## 1 teen 15914358. 500000 16487275393 1036
## 2 everyone 10+ 12472894. 1000000 4016271795 322
## 3 everyone 6602474. 50000 52179352961 7903
## 4 mature 17+ 6203529. 500000 2437986878 393
## 5 adults only 18+ 666667. 500000 2000000 3
## 6 unrated 25250 25250 50500 2
# Basic statistics for Ratings by Content Rating
ratings_by_content <- data_updated %>%
group_by(Content.Rating) %>%
summarise(
mean_rating = mean(Rating, na.rm = TRUE),
median_rating = median(Rating, na.rm = TRUE),
total_ratings = sum(Rating, na.rm = TRUE),
n_apps = n()
) %>%
arrange(desc(mean_rating))
print("Summary of Ratings by Content Rating:")
## [1] "Summary of Ratings by Content Rating:"
print(ratings_by_content)
## # A tibble: 6 × 5
## Content.Rating mean_rating median_rating total_ratings n_apps
## <fct> <dbl> <dbl> <dbl> <int>
## 1 adults only 18+ 4.3 4.5 12.9 3
## 2 everyone 10+ 4.22 4.3 1360. 322
## 3 teen 4.22 4.2 4371. 1036
## 4 everyone 4.17 4.2 32935. 7903
## 5 unrated 4.14 4.14 8.27 2
## 6 mature 17+ 4.13 4.2 1622. 393
# Basic statistics for Reviews by Content Rating
reviews_by_content <- data_updated %>%
group_by(Content.Rating) %>%
summarise(
mean_reviews = mean(Reviews, na.rm = TRUE),
median_reviews = median(Reviews, na.rm = TRUE),
total_reviews = sum(Reviews, na.rm = TRUE),
n_apps = n()
) %>%
arrange(desc(mean_reviews))
print("Summary of Reviews by Content Rating:")
## [1] "Summary of Reviews by Content Rating:"
print(reviews_by_content)
## # A tibble: 6 × 5
## Content.Rating mean_reviews median_reviews total_reviews n_apps
## <fct> <dbl> <dbl> <dbl> <int>
## 1 everyone 10+ 625243. 19023 201328121 322
## 2 teen 485803. 10144 503292211 1036
## 3 mature 17+ 221471. 3414 87038201 393
## 4 everyone 164536. 573 1300326506 7903
## 5 adults only 18+ 27116 24005 81348 3
## 6 unrated 594. 594. 1187 2
# Create days_since_update and data preparation
data_updated <- data_final %>%
mutate(
# Convert Last.Updated to proper date format (assuming it's in standard format)
last_updated = as.Date(Last.Updated),
current_date = Sys.Date(),
# Calculate days since last update
days_since_update = as.numeric(difftime(current_date, last_updated, units = "days")),
# Extract month from last_updated date
update_month = month(last_updated)
) %>%
# Remove any invalid dates or NA values
filter(!is.na(last_updated), !is.na(days_since_update))
# Create subset for update analysis
data_updated <- data_updated %>% filter(!is.na(days_since_update))
# Calculate update statistics by Content Rating
update_patterns <- data_updated %>%
group_by(Content.Rating) %>%
summarize(
avg_days_since_update = mean(days_since_update, na.rm = TRUE),
median_days_since_update = median(days_since_update, na.rm = TRUE),
sd_days_since_update = sd(days_since_update, na.rm = TRUE),
n_apps = n(),
cv = sd_days_since_update / avg_days_since_update * 100 # Coefficient of Variation
) %>%
arrange(avg_days_since_update)
print("\nUpdate Patterns by Content Rating:")
## [1] "\nUpdate Patterns by Content Rating:"
print(update_patterns)
## # A tibble: 6 × 6
## Content.Rating avg_days_since_update median_days_since_update
## <chr> <dbl> <dbl>
## 1 adults only 18+ 2331. 2328
## 2 mature 17+ 2484. 2343
## 3 teen 2561. 2377
## 4 everyone 10+ 2570. 2376
## 5 everyone 2605. 2423
## 6 unrated 4060. 4060.
## # ℹ 3 more variables: sd_days_since_update <dbl>, n_apps <int>, cv <dbl>
# Create monthly update counts
update_heatmap_data <- data_updated %>%
group_by(update_month, Content.Rating) %>%
summarize(count = n(), .groups = 'drop') %>%
# Ensure all months and ratings are included, even if count is 0
complete(
update_month = 1:12,
Content.Rating = unique(data_updated$Content.Rating),
fill = list(count = 0)
) %>%
# Reshape data for heatmap
pivot_wider(
names_from = Content.Rating,
values_from = count
)
# Convert to matrix for traditional heatmap
update_matrix <- as.matrix(update_heatmap_data[,-1])
rownames(update_matrix) <- month.abb[update_heatmap_data$update_month]
# Create enhanced heatmap using ggplot2
heatmap_data_long <- melt(update_matrix)
colnames(heatmap_data_long) <- c("Month", "Content_Rating", "Count")
heatmap_data_long$Month <- factor(heatmap_data_long$Month, levels = month.abb)
# Create the heatmap visualization
ggplot(heatmap_data_long, aes(x = Content_Rating, y = Month, fill = Count)) +
geom_tile(color = "white") + # Add white borders between tiles
scale_fill_gradient(
low = "white",
high = "steelblue",
name = "Number of Updates"
) +
theme_minimal() +
labs(
title = "App Update Patterns by Content Rating",
x = "Content Rating",
y = "Month",
subtitle = paste("Data as of", format(Sys.Date(), "%B %d, %Y"))
) +
theme(
axis.text.x = element_text(angle = 45, hjust = 1),
plot.title = element_text(hjust = 0.5, face = "bold"),
plot.subtitle = element_text(hjust = 0.5),
panel.grid = element_blank(),
panel.border = element_rect(fill = NA, color = "grey80"),
legend.position = "right"
)
# Calculate update velocity
update_velocity <- data_updated %>%
group_by(Content.Rating) %>%
summarize(
update_velocity = n() / n_distinct(update_month),
total_apps = n(),
avg_days_between_updates = mean(days_since_update, na.rm = TRUE)
) %>%
arrange(desc(update_velocity))
print("\nUpdate Velocity by Content Rating:")
## [1] "\nUpdate Velocity by Content Rating:"
print(update_velocity)
## # A tibble: 6 × 4
## Content.Rating update_velocity total_apps avg_days_between_updates
## <chr> <dbl> <int> <dbl>
## 1 everyone 659. 7903 2605.
## 2 teen 86.3 1036 2561.
## 3 mature 17+ 32.8 393 2484.
## 4 everyone 10+ 26.8 322 2570.
## 5 adults only 18+ 1.5 3 2331.
## 6 unrated 1 2 4060.
# Optional: Additional summary statistics for days since update
summary_stats <- data_updated %>%
summarize(
mean_days = mean(days_since_update, na.rm = TRUE),
median_days = median(days_since_update, na.rm = TRUE),
min_days = min(days_since_update, na.rm = TRUE),
max_days = max(days_since_update, na.rm = TRUE),
q1_days = quantile(days_since_update, 0.25, na.rm = TRUE),
q3_days = quantile(days_since_update, 0.75, na.rm = TRUE)
)
print("\nOverall Summary Statistics for Days Since Update:")
## [1] "\nOverall Summary Statistics for Days Since Update:"
print(summary_stats)
## mean_days median_days min_days max_days q1_days q3_days
## 1 2594.185 2409 2313 5314 2335 2680.5
This column represents the average number of updates per app for each content rating category. It reflects how frequently apps in each category receive updates.
# # 1. Update Cycle Analysis
# data_updated <- data_updated %>%
# mutate(
# Last.Updated = as.Date(Last.Updated, format = "%B %d, %Y"),
# day_of_week = wday(Last.Updated, label = TRUE),
# week_of_year = week(Last.Updated),
# month_of_year = month(Last.Updated, label = TRUE),
# season = case_when(
# month_of_year %in% c("Dec", "Jan", "Feb") ~ "Winter",
# month_of_year %in% c("Mar", "Apr", "May") ~ "Spring",
# month_of_year %in% c("Jun", "Jul", "Aug") ~ "Summer",
# TRUE ~ "Fall"
# )
# )
#
# # Day of Week Update Pattern by Content Rating
# dow_pattern <- data_updated %>%
# group_by(Content.Rating, day_of_week) %>%
# summarise(count = n()) %>%
# group_by(Content.Rating) %>%
# mutate(percentage = count/sum(count) * 100)
#
# ggplot(dow_pattern, aes(x = day_of_week, y = percentage, fill = Content.Rating)) +
# geom_bar(stat = "identity", position = "dodge") +
# facet_wrap(~Content.Rating) +
# labs(title = "Update Day Preferences by Content Rating",
# x = "Day of Week",
# y = "Percentage of Updates") +
# theme_minimal() +
# theme(axis.text.x = element_text(angle = 45, hjust = 1))
# Update Interval Analysis
update_intervals <- data_updated %>%
group_by(Content.Rating) %>%
arrange(Last.Updated) %>%
mutate(days_between_updates = as.numeric(Last.Updated - lag(Last.Updated))) %>%
summarise(
mean_interval = mean(days_between_updates, na.rm = TRUE),
median_interval = median(days_between_updates, na.rm = TRUE),
std_dev = sd(days_between_updates, na.rm = TRUE),
cv = std_dev / mean_interval * 100 # Coefficient of Variation
)
print("Update Interval Analysis:")
## [1] "Update Interval Analysis:"
print(update_intervals)
## # A tibble: 6 × 5
## Content.Rating mean_interval median_interval std_dev cv
## <chr> <dbl> <dbl> <dbl> <dbl>
## 1 adults only 18+ 15 15 7.07 47.1
## 2 everyone 0.380 0 3.53 929.
## 3 everyone 10+ 8.33 1 46.5 557.
## 4 mature 17+ 5.48 0 21.5 392.
## 5 teen 2.36 0 14.7 622.
## 6 unrated 1213 1213 NA NA
# Create data_updated with seasonal information while keeping data_final unchanged
data_updated <- data_final %>%
mutate(
last_updated = as.Date(Last.Updated),
current_date = Sys.Date(),
days_since_update = as.numeric(difftime(current_date, last_updated, units = "days")),
update_month = month(last_updated),
season = case_when(
update_month %in% c(12, 1, 2) ~ "Winter",
update_month %in% c(3, 4, 5) ~ "Spring",
update_month %in% c(6, 7, 8) ~ "Summer",
update_month %in% c(9, 10, 11) ~ "Fall"
)
) %>%
filter(!is.na(last_updated), !is.na(days_since_update))
# Calculate seasonal update intensity
seasonal_intensity <- data_updated %>%
group_by(Content.Rating, season) %>%
summarise(
update_count = n(),
update_intensity = n() / n_distinct(last_updated),
avg_days_between_updates = mean(days_since_update, na.rm = TRUE),
.groups = 'drop'
) %>%
mutate(season = factor(season, levels = c("Winter", "Spring", "Summer", "Fall"))) %>%
arrange(Content.Rating, desc(update_intensity))
# Create enhanced seasonal bar plot
seasonal_plot <- ggplot(seasonal_intensity,
aes(x = season, y = update_intensity, fill = Content.Rating)) +
geom_bar(stat = "identity", position = "dodge", width = 0.8) +
scale_fill_brewer(palette = "Set3") +
labs(
title = "Seasonal Update Intensity by Content Rating",
subtitle = paste("Analysis Period:", format(min(data_updated$last_updated), "%B %Y"),
"to", format(max(data_updated$last_updated), "%B %Y")),
x = "Season",
y = "Update Intensity",
fill = "Content Rating"
) +
theme_minimal() +
theme(
plot.title = element_text(hjust = 0.5, face = "bold", size = 14),
plot.subtitle = element_text(hjust = 0.5, size = 10),
axis.text.x = element_text(angle = 0),
panel.grid.major.x = element_blank(),
panel.grid.minor = element_blank(),
legend.position = "right"
)
# Create seasonal heatmap
seasonal_heatmap <- ggplot(seasonal_intensity,
aes(x = season, y = Content.Rating, fill = update_intensity)) +
geom_tile(color = "white") +
scale_fill_gradient2(
low = "white",
high = "steelblue",
name = "Update\nIntensity"
) +
labs(
title = "Seasonal Update Patterns Heatmap",
x = "Season",
y = "Content Rating"
) +
theme_minimal() +
theme(
plot.title = element_text(hjust = 0.5, face = "bold"),
axis.text.x = element_text(angle = 0),
panel.grid = element_blank(),
legend.position = "right"
)
# Print both plots side by side
library(gridExtra)
grid.arrange(seasonal_plot, seasonal_heatmap, ncol = 2)
# Print seasonal statistics
print("\nSeasonal Update Intensity Statistics:")
## [1] "\nSeasonal Update Intensity Statistics:"
print(seasonal_intensity)
## # A tibble: 19 × 5
## Content.Rating season update_count update_intensity avg_days_between_updates
## <chr> <fct> <int> <dbl> <dbl>
## 1 adults only 18+ Summer 3 1 2331.
## 2 everyone Summer 3992 11.1 2470.
## 3 everyone Spring 1826 5.42 2630.
## 4 everyone Winter 1202 3.70 2791.
## 5 everyone Fall 883 3.06 2910.
## 6 everyone 10+ Summer 190 2.5 2433.
## 7 everyone 10+ Spring 54 1.23 2664.
## 8 everyone 10+ Fall 32 1.19 2914
## 9 everyone 10+ Winter 46 1.18 2782.
## 10 mature 17+ Summer 269 3.90 2372.
## 11 mature 17+ Spring 59 1.23 2661.
## 12 mature 17+ Winter 44 1.1 2714.
## 13 mature 17+ Fall 21 1.05 2935.
## 14 teen Summer 612 4.67 2439.
## 15 teen Spring 203 1.69 2619.
## 16 teen Winter 109 1.35 2780.
## 17 teen Fall 112 1.26 2903.
## 18 unrated Summer 1 1 3454
## 19 unrated Winter 1 1 4667
# Additional seasonal summary
seasonal_summary <- data_updated %>%
group_by(season) %>%
summarise(
total_updates = n(),
avg_days_since_update = mean(days_since_update, na.rm = TRUE),
median_days_since_update = median(days_since_update, na.rm = TRUE),
n_apps = n_distinct(Content.Rating),
.groups = 'drop'
) %>%
arrange(match(season, c("Winter", "Spring", "Summer", "Fall")))
print("\nOverall Seasonal Summary:")
## [1] "\nOverall Seasonal Summary:"
print(seasonal_summary)
## # A tibble: 4 × 5
## season total_updates avg_days_since_update median_days_since_update n_apps
## <chr> <int> <dbl> <dbl> <int>
## 1 Winter 1402 2788. 2546 5
## 2 Spring 2142 2631. 2441 4
## 3 Summer 5067 2460. 2337 6
## 4 Fall 1048 2910. 2642. 4
The visualization shows the seasonal update intensity for various content ratings across different seasons (Fall, Spring, Summer, and Winter). The “Update Intensity” measures how frequently updates occurred, normalized by the number of distinct update events. The graph reveals that content rated as “everyone” consistently exhibits higher update intensity across all seasons, particularly peaking during the Summer. Other content ratings, such as “mature 17+” and “teen,” show notable but lower intensities, with a generally even distribution across seasons. This pattern suggests that applications rated for general audiences tend to undergo more frequent updates, especially during the Summer, potentially to meet increased demand or prepare for seasonal trends.
installs_by_rating <- data_updated %>%
group_by(Content.Rating) %>%
summarise(
mean_installs = mean(Installs, na.rm = TRUE),
median_installs = median(Installs, na.rm = TRUE),
total_installs = sum(Installs, na.rm = TRUE),
n_apps = n()
) %>%
arrange(desc(mean_installs))
# Visualize distribution of installs by content rating
ggplot(data_updated, aes(x = Content.Rating, y = log10(Installs))) +
geom_boxplot(fill = "lightblue") +
labs(title = "Distribution of App Installs by Content Rating",
x = "Content Rating",
y = "Log10(Number of Installs)") +
theme_minimal() +
theme(axis.text.x = element_text(angle = 45, hjust = 1))
data_analysis <- data_updated %>%
mutate(
days_since_update = as.numeric(difftime(max(Last.Updated), Last.Updated, units = "days")),
update_year = year(Last.Updated),
update_month = month(Last.Updated)
)
data_analysis <- data_analysis %>%
mutate(update_recency = ifelse(days_since_update <= median(days_since_update),
"Recent Update", "Old Update"))
recent_vs_old <- data_analysis %>%
group_by(Content.Rating, update_recency) %>%
summarise(
mean_installs = mean(Installs, na.rm = TRUE),
median_installs = median(Installs, na.rm = TRUE),
n_apps = n()
)
print("\nComparison of Installs by Update Recency and Content Rating:")
## [1] "\nComparison of Installs by Update Recency and Content Rating:"
print(recent_vs_old)
## # A tibble: 10 × 5
## # Groups: Content.Rating [6]
## Content.Rating update_recency mean_installs median_installs n_apps
## <chr> <chr> <dbl> <dbl> <int>
## 1 adults only 18+ Recent Update 666667. 500000 3
## 2 everyone Old Update 1787608. 10000 4110
## 3 everyone Recent Update 11819742. 500000 3793
## 4 everyone 10+ Old Update 2711120. 100000 135
## 5 everyone 10+ Recent Update 19520163. 1000000 187
## 6 mature 17+ Old Update 875646. 100000 118
## 7 mature 17+ Recent Update 8489675. 500000 275
## 8 teen Old Update 1625562. 50000 441
## 9 teen Recent Update 26504878. 1000000 595
## 10 unrated Old Update 25250 25250 2
# 7. Visualization of update recency effect
ggplot(data_analysis, aes(x = Content.Rating, y = log10(Installs), fill = update_recency)) +
geom_boxplot() +
labs(title = "Install Distribution by Content Rating and Update Recency",
x = "Content Rating",
y = "Log10(Number of Installs)",
fill = "Update Recency") +
theme_minimal() +
theme(axis.text.x = element_text(angle = 45, hjust = 1))
The boxplot shows the distribution of app installs across different
content ratings, segmented by update recency (old vs. recent). Apps with
recent updates generally have higher median installs compared to those
with older updates, indicating that more frequently updated apps tend to
attract more users. This trend is evident across most content ratings,
especially for categories like “everyone” and “teen,” where recent
updates show a noticeable increase in the upper range of installs. For
“everyone 10+” and “mature 17+,” the difference between old and recent
updates is less pronounced, suggesting that the effect of update recency
on installs might be weaker in these categories. The “adults only 18+”
and “unrated” categories still exhibit lower install numbers overall,
regardless of update recency, highlighting the limited popularity of
these app types.
# 3. Timeline analysis: Average installs over time by content rating
installs_timeline <- data_updated %>%
group_by(Content.Rating, Last.Updated) %>%
summarise(avg_installs = mean(Installs, na.rm = TRUE)) %>%
ungroup()
ggplot(installs_timeline, aes(x = Last.Updated, y = log10(avg_installs), color = Content.Rating)) +
geom_smooth(method = "loess", se = FALSE) +
labs(title = "Average App Installs Over Time by Content Rating",
x = "Last Updated Date",
y = "Log10(Average Installs)") +
theme_minimal() +
theme(legend.position = "bottom")
The line graph depicts the trend of average app installs over time for
different content ratings, with the y-axis on a logarithmic scale
(
log10). The curves reveal that apps with broader content
ratings like “everyone” and “everyone 10+” show significant growth in
average installs, particularly from 2016 onwards, reaching a peak around
2018. This indicates a surge in popularity and possibly greater user
engagement or app availability during that period. Similarly, “mature
17+” apps follow a parallel trend but start with higher average installs
and decline around 2012 before recovering alongside the other
categories.
The “teen” content rating exhibits a unique pattern with fluctuating growth, maintaining relatively steady installs before rising sharply from 2016 onwards. In contrast, “adults only 18+” shows a limited increase, suggesting that apps with this rating have a smaller user base. The convergence of all content ratings towards higher install averages near 2018 reflects an overall trend in the app market where app downloads increased across various content ratings.
Lets convert all the categorical variables into factors and then convert into numerical dataframe for calucalting the correlation matrix
# Step 1: Create a copy of the original data without specific columns
columns_to_remove <- c("App", "Scaled_Reviews", "update_year", "update_month",
"update_quarter", "days_since_update", "week_of_year",
"Last.Updated", "day_of_week", "month_of_year", "season")
data_numeric_or_factor <- data_updated %>%
select(-any_of(columns_to_remove)) # Changed to any_of to handle missing columns gracefully
# Step 2: Identify and convert character columns to factors
data_numeric_or_factor <- data_numeric_or_factor %>%
mutate(across(where(is.character), as.factor))
# Step 3: Create a copy for factor data
data_factor <- data_numeric_or_factor
# Step 4: Identify numeric and factor columns
numeric_columns <- sapply(data_numeric_or_factor, is.numeric)
factor_columns <- sapply(data_numeric_or_factor, is.factor)
# Step 5: Convert factors to numeric while preserving numeric columns
data_final_numeric <- data_numeric_or_factor %>%
mutate(across(where(is.factor), ~as.numeric(as.factor(.))))
# Step 6: Check for any non-numeric columns and remove them
non_numeric_cols <- names(data_final_numeric)[!sapply(data_final_numeric, is.numeric)]
if(length(non_numeric_cols) > 0) {
data_final_numeric <- data_final_numeric %>%
select(-all_of(non_numeric_cols))
}
# Step 7: Calculate correlations
# Pearson correlation
pearson_correlation <- cor(data_final_numeric,
method = "pearson",
use = "complete.obs")
# Spearman correlation
spearman_correlation <- cor(data_final_numeric,
method = "spearman",
use = "complete.obs")
# Step 9: Create enhanced correlation plots
# Pearson correlation plot
corrplot(pearson_correlation,
method = "color",
type = "upper",
order = "hclust",
addCoef.col = "black",
tl.col = "black",
tl.srt = 45,
number.cex = 0.7,
title = "Pearson Correlation Matrix")
# Spearman correlation plot
corrplot(spearman_correlation,
method = "color",
type = "upper",
order = "hclust",
addCoef.col = "black",
tl.col = "black",
tl.srt = 45,
number.cex = 0.7,
title = "Spearman Correlation Matrix")
From the above correlation matrix:
As seen installs has the highest correlation with the reviews.
As we can see from the both pearson and spearman have relatively different correlation matrices and plots. We can refer to the categorical variables correlation from the spearman.
As seen reviews has the highest correlation(positive) with the installs and then in spearman correlation matrix it has high correlation(positive) with content rating and android version meaning
Rating is not much correlated with any of the variables, only slightly positively correlated with reviews and installs which was also demonstrated through visualisation previously.
Price vs. Log_Installs: -0.06, suggesting a very weak negative relationship between price and the number of installs.
# Create a new data frame with relevant variables for correlation analysis
correlation_data <- data_analysis %>%
select(days_since_update, update_year, update_month) %>%
mutate(log_installs = log10(data_final$Installs))
# Calculate the correlation matrix
correlation_matrix <- cor(correlation_data, method = "spearman", use = "complete.obs")
# Print the correlation matrix
print("Spearman Correlation Matrix:")
## [1] "Spearman Correlation Matrix:"
corrplot(correlation_matrix, method = "color",
col = colorRampPalette(c("red", "white", "blue"))(200),
type = "upper",
tl.col = "black", tl.srt = 45,
addCoef.col = "black", # Add correlation coefficients
number.cex = 0.7, # Adjust size of numbers
title = "Correlation Matrix", # Title
mar = c(0, 0, 1, 0)) # Margins
Correlation Analysis: A moderate negative correlation :(ρ=−0.3317) was found between the number of days since the last update and the log-transformed installs. This indicates that as the time since the last update increases, the number of installs tends to decrease. The relationship is statistically significant (p < 2.2e-16), suggesting that timely updates may be crucial for maintaining user engagement.
# Calculate Pearson correlation and perform the test
cor_test <- cor.test(data_clean$Size, data_clean$Installs, method = "pearson")
# Output the correlation coefficient and p-value
cor_test
##
## Pearson's product-moment correlation
##
## data: data_clean$Size and data_clean$Installs
## t = 4.0069, df = 9657, p-value = 6.198e-05
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## 0.02081430 0.06063426
## sample estimates:
## cor
## 0.04074046
According to the relational hypothesis testing: 1. Correlation Coefficient (cor):Pearson correlation coefficient is 0.0407. This indicates a very weak positive relationship between Size and Installs—meaning that as app size increases, installs slightly tend to increase as well, but the effect is minimal.
P-value): The p-value is 6.198e-05 (or 0.00006198), which is much smaller than the conventional significance level (e.g., 0.05). This low p-value means that we can reject the null hypothesis (that there is no correlation) and conclude that x and y are not independent.
Confidence Interval: The 95% confidence interval for the correlation coefficient is between 0.0208 and 0.0606. This range is quite narrow and close to zero, further confirming that while the relationship is significant, the strength of the correlation is very low.
# Convert "Last Updated" to Date format and calculate days since a reference date
data_updated$Last.Updated <- as.Date(data_updated$Last.Updated, format = "%B %d, %Y")
reference_date <- as.Date("2024-01-01")
data_updated$Days.Since.Last.Update <- as.numeric(difftime(reference_date, data_updated$Last.Updated, units = "days"))
# Clean "Installs" to remove "+" and "," characters and convert to numeric
data_updated$Installs <- as.numeric(gsub("[+,]", "", data_updated$Installs))
# Ensure "Reviews" is numeric
data_updated$Reviews <- as.numeric(data_updated$Reviews)
# Encode "Content Rating" as a factor and then to numeric
data_updated$Content.Rating.Encoded <- as.numeric(as.factor(data_updated$Content.Rating))
# Select relevant columns for correlation calculation
correlation_data <- data_updated %>%
select(Days.Since.Last.Update, Content.Rating.Encoded, Rating, Reviews, Installs)
# Calculate correlations for specific columns
selected_correlations <- cor(correlation_data, use = "complete.obs")[c("Days.Since.Last.Update", "Content.Rating.Encoded"), c("Rating", "Reviews", "Installs")]
# Print the selected correlations
print(selected_correlations)
## Rating Reviews Installs
## Days.Since.Last.Update -0.12111772 -0.06576669 -0.07787303
## Content.Rating.Encoded 0.02591249 0.05562098 0.04980714
Implications These findings suggest that regular updates are important for sustaining app installs, and that different content ratings can influence user engagement. Strategies aimed at timely updates and optimizing content ratings could enhance app performance and user acquisition.
# Check for missing values and ensure no negative/zero values in log_Installs
#data_final <- data_final %>%
#filter(!is.na(Installs), Installs > 0) # Remove missing values and zeros in Installs
# Apply log transformation, adding 1 to avoid log(0)
#data_final$log_Installs <- log(data_final$Installs + 1)
# Ensure Price_Category has no missing values
#data_final <- data_final %>%
#filter(!is.na(Price_Category))
#Perform t-test on log-transformed Installs by Price Category
#t_test_result <- t.test(log_Installs ~ Price_Category, data = data_final, var.equal = FALSE)
#Print t-test results
#print(t_test_result)
There is a statistically significant difference between the number of installs for “Free” and “Paid” apps, with the p-value being extremely small.
From the above analysis, we can practically state that free apps are more popular than paid apps, which can be considered true in the app market.
#Confirming with a t-test
# Perform t-test for Reviews between Free and Paid
t_test_reviews <- t.test(Reviews ~ Price_Category, data = data_updated)
# Perform t-test for Rating between Free and Paid
t_test_rating <- t.test(Rating ~ Price_Category, data = data_updated)
# Print the results
print(t_test_reviews)
##
## Welch Two Sample t-test
##
## data: Reviews by Price_Category
## t = 11.019, df = 9299.1, p-value < 2.2e-16
## alternative hypothesis: true difference in means between group Free and group Paid is not equal to 0
## 95 percent confidence interval:
## 185401.3 265636.3
## sample estimates:
## mean in group Free mean in group Paid
## 234243.689 8724.888
print(t_test_rating)
##
## Welch Two Sample t-test
##
## data: Rating by Price_Category
## t = -3.9443, df = 883.57, p-value = 8.638e-05
## alternative hypothesis: true difference in means between group Free and group Paid is not equal to 0
## 95 percent confidence interval:
## -0.1121028 -0.0376075
## sample estimates:
## mean in group Free mean in group Paid
## 4.167384 4.242239
There is a statistically significant difference between the mean number of reviews for Free and Paid apps. Free apps have significantly more reviews on average.
There is a statistically significant difference between the mean ratings for Free and Paid apps. Paid apps have slightly higher ratings on average, though the difference is small.
The tests below are to test whether or not different review categories have different average ratings.
anova_result <- aov(Rating ~ as.factor(Review_Category), data = data_clean)
summary(anova_result)
## Df Sum Sq Mean Sq F value Pr(>F)
## as.factor(Review_Category) 11 106.3 9.662 41.36 <2e-16 ***
## Residuals 9647 2253.6 0.234
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
According to p-value, it is significant hence we can say that the average rating for all review categories is not same.
# Perform Tukey's HSD
tukey_result <- TukeyHSD(anova_result)
tukey_result
## Tukey multiple comparisons of means
## 95% family-wise confidence level
##
## Fit: aov(formula = Rating ~ as.factor(Review_Category), data = data_clean)
##
## $`as.factor(Review_Category)`
## diff lwr upr p adj
## 100+-0+ -0.096683215 -0.152307271 -0.04105916 0.0000009
## 500+-0+ -0.063032835 -0.141474646 0.01540898 0.2646281
## 1K+-0+ -0.019190832 -0.089971134 0.05158947 0.9992526
## 2.5K+-0+ 0.003350463 -0.074143085 0.08084401 1.0000000
## 5K+-0+ 0.064918154 -0.012646893 0.14248320 0.2087515
## 10K+-0+ 0.095614797 0.030638525 0.16059107 0.0000973
## 25K+-0+ 0.105627098 0.035846939 0.17540726 0.0000488
## 50K+-0+ 0.167554014 0.091642554 0.24346547 0.0000000
## 100K+-0+ 0.203608898 0.135724795 0.27149300 0.0000000
## 300K+-0+ 0.249388670 0.170111342 0.32866600 0.0000000
## 1M+-0+ 0.300139945 0.211244127 0.38903576 0.0000000
## 500+-100+ 0.033650380 -0.054364565 0.12166533 0.9848292
## 1K+-100+ 0.077492383 -0.003768703 0.15875347 0.0784345
## 2.5K+-100+ 0.100033678 0.012862795 0.18720456 0.0096675
## 5K+-100+ 0.161601369 0.074366918 0.24883582 0.0000001
## 10K+-100+ 0.192298012 0.116039053 0.26855697 0.0000000
## 25K+-100+ 0.202310313 0.121918874 0.28270175 0.0000000
## 50K+-100+ 0.264237229 0.178469737 0.35000472 0.0000000
## 100K+-100+ 0.300292113 0.221540831 0.37904339 0.0000000
## 300K+-100+ 0.346071885 0.257311491 0.43483228 0.0000000
## 1M+-100+ 0.396823160 0.299375844 0.49427048 0.0000000
## 1K+-500+ 0.043842003 -0.054455739 0.14213974 0.9515761
## 2.5K+-500+ 0.066383298 -0.036853541 0.16962014 0.6214468
## 5K+-500+ 0.127950989 0.024660470 0.23124151 0.0030189
## 10K+-500+ 0.158647632 0.064443010 0.25285225 0.0000025
## 25K+-500+ 0.168659933 0.071079887 0.26623998 0.0000011
## 50K+-500+ 0.230586849 0.128532233 0.33264146 0.0000000
## 100K+-500+ 0.266641733 0.170408442 0.36287502 0.0000000
## 300K+-500+ 0.312421505 0.207839051 0.41700396 0.0000000
## 1M+-500+ 0.363172780 0.251123410 0.47522215 0.0000000
## 2.5K+-1K+ 0.022541295 -0.075001405 0.12008400 0.9998394
## 5K+-1K+ 0.084108986 -0.013490527 0.18170850 0.1727899
## 10K+-1K+ 0.114805629 0.026878134 0.20273312 0.0012014
## 25K+-1K+ 0.124817930 0.033283243 0.21635262 0.0005180
## 50K+-1K+ 0.186744846 0.090454254 0.28303544 0.0000000
## 100K+-1K+ 0.222799730 0.132702117 0.31289734 0.0000000
## 300K+-1K+ 0.268579502 0.169613735 0.36754527 0.0000000
## 1M+-1K+ 0.319330777 0.212504774 0.42615678 0.0000000
## 5K+-2.5K+ 0.061567691 -0.041004546 0.16413993 0.7193424
## 10K+-2.5K+ 0.092264334 -0.001152170 0.18568084 0.0565429
## 25K+-2.5K+ 0.102276635 0.005457227 0.19909604 0.0276896
## 50K+-2.5K+ 0.164203551 0.062875978 0.26553112 0.0000078
## 100K+-2.5K+ 0.200258435 0.104796512 0.29572036 0.0000000
## 300K+-2.5K+ 0.246038206 0.142165102 0.34991131 0.0000000
## 1M+-2.5K+ 0.296789482 0.185401898 0.40817707 0.0000000
## 10K+-5K+ 0.030696643 -0.062779181 0.12417247 0.9957463
## 25K+-5K+ 0.040708944 -0.056167701 0.13758559 0.9685508
## 50K+-5K+ 0.102635860 0.001253596 0.20401812 0.0440982
## 100K+-5K+ 0.138690744 0.043170771 0.23421072 0.0001331
## 300K+-5K+ 0.184470516 0.080544059 0.28839697 0.0000004
## 1M+-5K+ 0.235221791 0.123784453 0.34665913 0.0000000
## 25K+-10K+ 0.010012302 -0.077112114 0.09713672 0.9999999
## 50K+-10K+ 0.071939217 -0.020169104 0.16404754 0.3070668
## 100K+-10K+ 0.107994101 0.022380758 0.19360745 0.0022235
## 300K+-10K+ 0.153773873 0.058872409 0.24867534 0.0000078
## 1M+-10K+ 0.204525148 0.101453039 0.30759726 0.0000000
## 50K+-25K+ 0.061926916 -0.033630908 0.15748474 0.6094814
## 100K+-25K+ 0.097981800 0.008667751 0.18729585 0.0175649
## 300K+-25K+ 0.143761571 0.045508620 0.24201452 0.0001113
## 1M+-25K+ 0.194512847 0.088346871 0.30067882 0.0000001
## 100K+-50K+ 0.036054884 -0.058127272 0.13023704 0.9846717
## 300K+-50K+ 0.081834656 -0.020863551 0.18453286 0.2768896
## 1M+-50K+ 0.132585931 0.022293168 0.24287869 0.0048805
## 300K+-100K+ 0.045779772 -0.051135776 0.14269532 0.9282456
## 1M+-100K+ 0.096531047 -0.008398431 0.20146052 0.1064662
## 1M+-300K+ 0.050751275 -0.061884591 0.16338714 0.9479902
# Convert the result to a data frame
tukey_df <- as.data.frame(tukey_result$`as.factor(Review_Category)`)
# Filter for significant p-values
significant_tukey <- tukey_df[tukey_df[4] < 0.05, ]
# Display the significant results
print(significant_tukey)
## diff lwr upr p adj
## 100+-0+ -0.09668322 -0.152307271 -0.04105916 8.987756e-07
## 10K+-0+ 0.09561480 0.030638525 0.16059107 9.732720e-05
## 25K+-0+ 0.10562710 0.035846939 0.17540726 4.884843e-05
## 50K+-0+ 0.16755401 0.091642554 0.24346547 0.000000e+00
## 100K+-0+ 0.20360890 0.135724795 0.27149300 0.000000e+00
## 300K+-0+ 0.24938867 0.170111342 0.32866600 0.000000e+00
## 1M+-0+ 0.30013994 0.211244127 0.38903576 0.000000e+00
## 2.5K+-100+ 0.10003368 0.012862795 0.18720456 9.667490e-03
## 5K+-100+ 0.16160137 0.074366918 0.24883582 9.538328e-08
## 10K+-100+ 0.19229801 0.116039053 0.26855697 0.000000e+00
## 25K+-100+ 0.20231031 0.121918874 0.28270175 0.000000e+00
## 50K+-100+ 0.26423723 0.178469737 0.35000472 0.000000e+00
## 100K+-100+ 0.30029211 0.221540831 0.37904339 0.000000e+00
## 300K+-100+ 0.34607188 0.257311491 0.43483228 0.000000e+00
## 1M+-100+ 0.39682316 0.299375844 0.49427048 0.000000e+00
## 5K+-500+ 0.12795099 0.024660470 0.23124151 3.018884e-03
## 10K+-500+ 0.15864763 0.064443010 0.25285225 2.473396e-06
## 25K+-500+ 0.16865993 0.071079887 0.26623998 1.080775e-06
## 50K+-500+ 0.23058685 0.128532233 0.33264146 0.000000e+00
## 100K+-500+ 0.26664173 0.170408442 0.36287502 0.000000e+00
## 300K+-500+ 0.31242150 0.207839051 0.41700396 0.000000e+00
## 1M+-500+ 0.36317278 0.251123410 0.47522215 0.000000e+00
## 10K+-1K+ 0.11480563 0.026878134 0.20273312 1.201416e-03
## 25K+-1K+ 0.12481793 0.033283243 0.21635262 5.179950e-04
## 50K+-1K+ 0.18674485 0.090454254 0.28303544 1.572425e-08
## 100K+-1K+ 0.22279973 0.132702117 0.31289734 0.000000e+00
## 300K+-1K+ 0.26857950 0.169613735 0.36754527 0.000000e+00
## 1M+-1K+ 0.31933078 0.212504774 0.42615678 0.000000e+00
## 25K+-2.5K+ 0.10227664 0.005457227 0.19909604 2.768961e-02
## 50K+-2.5K+ 0.16420355 0.062875978 0.26553112 7.808701e-06
## 100K+-2.5K+ 0.20025843 0.104796512 0.29572036 3.507881e-10
## 300K+-2.5K+ 0.24603821 0.142165102 0.34991131 0.000000e+00
## 1M+-2.5K+ 0.29678948 0.185401898 0.40817707 0.000000e+00
## 50K+-5K+ 0.10263586 0.001253596 0.20401812 4.409823e-02
## 100K+-5K+ 0.13869074 0.043170771 0.23421072 1.331239e-04
## 300K+-5K+ 0.18447052 0.080544059 0.28839697 4.428778e-07
## 1M+-5K+ 0.23522179 0.123784453 0.34665913 2.244944e-10
## 100K+-10K+ 0.10799410 0.022380758 0.19360745 2.223466e-03
## 300K+-10K+ 0.15377387 0.058872409 0.24867534 7.832139e-06
## 1M+-10K+ 0.20452515 0.101453039 0.30759726 5.942656e-09
## 100K+-25K+ 0.09798180 0.008667751 0.18729585 1.756493e-02
## 300K+-25K+ 0.14376157 0.045508620 0.24201452 1.113055e-04
## 1M+-25K+ 0.19451285 0.088346871 0.30067882 1.436204e-07
## 1M+-50K+ 0.13258593 0.022293168 0.24287869 4.880458e-03
As we can see, the significant difference for average rating for different review categories is between 0+ and 1M+ as expected.
For easier Ratings and Reviews vs Installs we can group Installs into categories given
# 1. Encode content rating (e.g., as factor levels or one-hot encoding)
data_updated$Content.Rating <- as.factor(data_updated$Content.Rating)
data_updated <- data_updated %>%
filter(!is.na(Installs) & Installs > 0)
# ANOVA test for difference in installs between content ratings
install_anova <- aov(log10(Installs) ~ Content.Rating, data = data_updated)
print("\nANOVA test results for Installs by Content Rating:")
## [1] "\nANOVA test results for Installs by Content Rating:"
print(summary(install_anova))
## Df Sum Sq Mean Sq F value Pr(>F)
## Content.Rating 5 743 148.68 41.95 <2e-16 ***
## Residuals 9638 34160 3.54
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
ANOVA analysis : Revealed significant differences in install counts based on content rating (F(5, 9638) = 41.95, p < 2e-16). This indicates that various content ratings have a substantial impact on the number of installs, highlighting the importance of content quality and type in attracting users.
# Convert the 'last_updated' column to Date type
data_updated$last_updated <- as.Date(data_updated$last_updated, format = "%B %d, %Y")
# Calculate the difference in days between the maximum date and each date in 'last_updated'
data_updated$lastupdate <- as.numeric(difftime(max(data_updated$last_updated, na.rm = TRUE),
data_updated$last_updated,
units = "days"))
data_updated$last_updated <- NULL
data_updated <- data_updated[, !(names(data_updated) %in% c(
"Last.Updated", "Android.Ver", "last_updated", "current_date",
"days_since_update", "update_month", "season",
"Days.Since.Last.Update", "Content.Rating.Encoded","App"))]
# Rename a column
names(data_updated)[names(data_updated) == "Content.Rating"] <- "content_rating"
data_updated$content_rating <- as.numeric(data_updated$content_rating)
str(data_updated)
## 'data.frame': 9644 obs. of 8 variables:
## $ Category : Factor w/ 33 levels "ART_AND_DESIGN",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ Rating : num 4.1 3.9 4.7 4.5 4.3 4.4 3.8 4.1 4.4 4.7 ...
## $ Reviews : num 159 967 87510 215644 967 ...
## $ Size : num 19 14 8.7 25 2.8 5.6 19 29 33 3.1 ...
## $ Installs : num 1e+04 5e+05 5e+06 5e+07 1e+05 5e+04 5e+04 1e+06 1e+06 1e+04 ...
## $ Price : num 0 0 0 0 0 0 0 0 0 0 ...
## $ content_rating: num 2 2 2 5 2 2 2 2 2 2 ...
## $ lastupdate : num 213 205 7 61 49 500 104 55 322 36 ...
category_dummies <- model.matrix(~ Category - 1, data = data_updated)
colnames(category_dummies) <- gsub("Category", "cat", colnames(category_dummies))
# 3. Add dummy variables to the dataset and remove the original 'Category' column
data_updated <- cbind(data_updated, category_dummies)
data_updated$Category <- NULL
# 4. Replace spaces in column names with underscores
colnames(data_updated) <- gsub(" ", "_", colnames(data_updated))
# View the processed data
head(data_updated)
## Rating Reviews Size Installs Price content_rating lastupdate
## 1 4.1 159 19.0 1e+04 0 2 213
## 2 3.9 967 14.0 5e+05 0 2 205
## 3 4.7 87510 8.7 5e+06 0 2 7
## 4 4.5 215644 25.0 5e+07 0 5 61
## 5 4.3 967 2.8 1e+05 0 2 49
## 6 4.4 167 5.6 5e+04 0 2 500
## catART_AND_DESIGN catAUTO_AND_VEHICLES catBEAUTY catBOOKS_AND_REFERENCE
## 1 1 0 0 0
## 2 1 0 0 0
## 3 1 0 0 0
## 4 1 0 0 0
## 5 1 0 0 0
## 6 1 0 0 0
## catBUSINESS catCOMICS catCOMMUNICATION catDATING catEDUCATION
## 1 0 0 0 0 0
## 2 0 0 0 0 0
## 3 0 0 0 0 0
## 4 0 0 0 0 0
## 5 0 0 0 0 0
## 6 0 0 0 0 0
## catENTERTAINMENT catEVENTS catFAMILY catFINANCE catFOOD_AND_DRINK catGAME
## 1 0 0 0 0 0 0
## 2 0 0 0 0 0 0
## 3 0 0 0 0 0 0
## 4 0 0 0 0 0 0
## 5 0 0 0 0 0 0
## 6 0 0 0 0 0 0
## catHEALTH_AND_FITNESS catHOUSE_AND_HOME catLIBRARIES_AND_DEMO catLIFESTYLE
## 1 0 0 0 0
## 2 0 0 0 0
## 3 0 0 0 0
## 4 0 0 0 0
## 5 0 0 0 0
## 6 0 0 0 0
## catMAPS_AND_NAVIGATION catMEDICAL catNEWS_AND_MAGAZINES catPARENTING
## 1 0 0 0 0
## 2 0 0 0 0
## 3 0 0 0 0
## 4 0 0 0 0
## 5 0 0 0 0
## 6 0 0 0 0
## catPERSONALIZATION catPHOTOGRAPHY catPRODUCTIVITY catSHOPPING catSOCIAL
## 1 0 0 0 0 0
## 2 0 0 0 0 0
## 3 0 0 0 0 0
## 4 0 0 0 0 0
## 5 0 0 0 0 0
## 6 0 0 0 0 0
## catSPORTS catTOOLS catTRAVEL_AND_LOCAL catVIDEO_PLAYERS catWEATHER
## 1 0 0 0 0 0
## 2 0 0 0 0 0
## 3 0 0 0 0 0
## 4 0 0 0 0 0
## 5 0 0 0 0 0
## 6 0 0 0 0 0
# Load necessary libraries
library(ggplot2)
# Create two categories: Low Installs and High Installs
# Calculate the median of Installs to split into two categories
median_installs <- median(data_updated$Installs, na.rm = TRUE)
# Reclassify into two categories
data_updated$Installs_Category <- ifelse(data_updated$Installs <= median_installs, "Low Installs", "High Installs")
# Convert 'Installs_Category' to factor with levels "Low Installs" and "High Installs"
data_updated$Installs_Category <- factor(data_updated$Installs_Category,
levels = c("Low Installs", "High Installs"),
labels = c(0, 1))
# Check the conversion
table(data_updated$Installs_Category)
##
## 0 1
## 5744 3900
# Create a histogram for the new categories
ggplot(data_updated, aes(x = Installs_Category)) +
geom_bar(stat = "count", fill = "skyblue", color = "black") +
labs(title = "Histogram of Installs Category (Low vs High)",
x = "Installs Category",
y = "Count") +
theme_minimal()
str(data_updated)
## 'data.frame': 9644 obs. of 41 variables:
## $ Rating : num 4.1 3.9 4.7 4.5 4.3 4.4 3.8 4.1 4.4 4.7 ...
## $ Reviews : num 159 967 87510 215644 967 ...
## $ Size : num 19 14 8.7 25 2.8 5.6 19 29 33 3.1 ...
## $ Installs : num 1e+04 5e+05 5e+06 5e+07 1e+05 5e+04 5e+04 1e+06 1e+06 1e+04 ...
## $ Price : num 0 0 0 0 0 0 0 0 0 0 ...
## $ content_rating : num 2 2 2 5 2 2 2 2 2 2 ...
## $ lastupdate : num 213 205 7 61 49 500 104 55 322 36 ...
## $ catART_AND_DESIGN : num 1 1 1 1 1 1 1 1 1 1 ...
## $ catAUTO_AND_VEHICLES : num 0 0 0 0 0 0 0 0 0 0 ...
## $ catBEAUTY : num 0 0 0 0 0 0 0 0 0 0 ...
## $ catBOOKS_AND_REFERENCE: num 0 0 0 0 0 0 0 0 0 0 ...
## $ catBUSINESS : num 0 0 0 0 0 0 0 0 0 0 ...
## $ catCOMICS : num 0 0 0 0 0 0 0 0 0 0 ...
## $ catCOMMUNICATION : num 0 0 0 0 0 0 0 0 0 0 ...
## $ catDATING : num 0 0 0 0 0 0 0 0 0 0 ...
## $ catEDUCATION : num 0 0 0 0 0 0 0 0 0 0 ...
## $ catENTERTAINMENT : num 0 0 0 0 0 0 0 0 0 0 ...
## $ catEVENTS : num 0 0 0 0 0 0 0 0 0 0 ...
## $ catFAMILY : num 0 0 0 0 0 0 0 0 0 0 ...
## $ catFINANCE : num 0 0 0 0 0 0 0 0 0 0 ...
## $ catFOOD_AND_DRINK : num 0 0 0 0 0 0 0 0 0 0 ...
## $ catGAME : num 0 0 0 0 0 0 0 0 0 0 ...
## $ catHEALTH_AND_FITNESS : num 0 0 0 0 0 0 0 0 0 0 ...
## $ catHOUSE_AND_HOME : num 0 0 0 0 0 0 0 0 0 0 ...
## $ catLIBRARIES_AND_DEMO : num 0 0 0 0 0 0 0 0 0 0 ...
## $ catLIFESTYLE : num 0 0 0 0 0 0 0 0 0 0 ...
## $ catMAPS_AND_NAVIGATION: num 0 0 0 0 0 0 0 0 0 0 ...
## $ catMEDICAL : num 0 0 0 0 0 0 0 0 0 0 ...
## $ catNEWS_AND_MAGAZINES : num 0 0 0 0 0 0 0 0 0 0 ...
## $ catPARENTING : num 0 0 0 0 0 0 0 0 0 0 ...
## $ catPERSONALIZATION : num 0 0 0 0 0 0 0 0 0 0 ...
## $ catPHOTOGRAPHY : num 0 0 0 0 0 0 0 0 0 0 ...
## $ catPRODUCTIVITY : num 0 0 0 0 0 0 0 0 0 0 ...
## $ catSHOPPING : num 0 0 0 0 0 0 0 0 0 0 ...
## $ catSOCIAL : num 0 0 0 0 0 0 0 0 0 0 ...
## $ catSPORTS : num 0 0 0 0 0 0 0 0 0 0 ...
## $ catTOOLS : num 0 0 0 0 0 0 0 0 0 0 ...
## $ catTRAVEL_AND_LOCAL : num 0 0 0 0 0 0 0 0 0 0 ...
## $ catVIDEO_PLAYERS : num 0 0 0 0 0 0 0 0 0 0 ...
## $ catWEATHER : num 0 0 0 0 0 0 0 0 0 0 ...
## $ Installs_Category : Factor w/ 2 levels "0","1": 1 2 2 2 1 1 1 2 2 1 ...
So, now our final dataset after preprocessing is named as ‘data_updated’. We have selected five modelling techniques that would be best suitable for answering our SMART Question.
#————————————————————————-
# 2. Feature Engineering ------------------------------------------------
# Create binary target variable: High success (Installs >= median) vs. Low success
median_installs <- median(data_updated$Installs, na.rm = TRUE)
data_updated$Success <- ifelse(data_updated$Installs >= median_installs, 1, 0)
# Convert 'Category' to dummy variables (one-hot encoding)
# data_updated <- data_updated %>%
# mutate(Category = as.factor(Category)) %>%
# cbind(model.matrix(~ Category - 1, data_updated))
# Drop unused columns
#data_updated <- data_updated %>% select(-Android.Ver, -Content.Rating, -Last.Updated)
# 3. Data Splitting -----------------------------------------------------
# Separate features (X) and target (y)
#X <- data_final %>% select(-Installs, -Success)
X <- data_updated %>% select(-Installs, -Success) # Exclude the target variable
y <- data_updated$Success # Extract the target variable
table(y)
## y
## 0 1
## 4632 5012
#Split into training and testing sets
set.seed(123)
train_index <- createDataPartition(y, p = 0.7, list = FALSE)
# Define X_train, X_test, y_train, y_test
X_train <- X[train_index, ] %>% mutate(across(everything(), as.numeric))
X_test <- X[-train_index, ] %>% mutate(across(everything(), as.numeric))
y_train <- y[train_index]
y_test <- y[-train_index]
# Check the structure
# Convert data to matrix for XGBoost
dtrain <- xgb.DMatrix(data = as.matrix(X_train), label = y_train)
dtest <- xgb.DMatrix(data = as.matrix(X_test), label = y_test)
# 4. Train Gradient Boosting Model --------------------------------------
params <- list(
objective = "binary:logistic", # Binary classification
eval_metric = "logloss",
max_depth = 6,
eta = 0.1,
subsample = 0.8,
colsample_bytree = 0.8
)
# Train the model
set.seed(42)
xgb_model <- xgb.train(
params = params,
data = dtrain,
nrounds = 100,
watchlist = list(train = dtrain, test = dtest),
early_stopping_rounds = 10,
verbose = 1
)
## [1] train-logloss:0.609961 test-logloss:0.610901
## Multiple eval metrics are present. Will use test_logloss for early stopping.
## Will train until test_logloss hasn't improved in 10 rounds.
##
## [2] train-logloss:0.542076 test-logloss:0.544289
## [3] train-logloss:0.485184 test-logloss:0.488545
## [4] train-logloss:0.436916 test-logloss:0.441353
## [5] train-logloss:0.405047 test-logloss:0.409678
## [6] train-logloss:0.368169 test-logloss:0.373890
## [7] train-logloss:0.344374 test-logloss:0.350321
## [8] train-logloss:0.315461 test-logloss:0.323088
## [9] train-logloss:0.297176 test-logloss:0.304968
## [10] train-logloss:0.273833 test-logloss:0.282865
## [11] train-logloss:0.252967 test-logloss:0.263302
## [12] train-logloss:0.234827 test-logloss:0.245916
## [13] train-logloss:0.218308 test-logloss:0.231012
## [14] train-logloss:0.203750 test-logloss:0.217901
## [15] train-logloss:0.191103 test-logloss:0.206367
## [16] train-logloss:0.179810 test-logloss:0.195884
## [17] train-logloss:0.169687 test-logloss:0.186802
## [18] train-logloss:0.163163 test-logloss:0.180602
## [19] train-logloss:0.154717 test-logloss:0.172802
## [20] train-logloss:0.150294 test-logloss:0.168185
## [21] train-logloss:0.143017 test-logloss:0.161509
## [22] train-logloss:0.136578 test-logloss:0.156003
## [23] train-logloss:0.130758 test-logloss:0.150831
## [24] train-logloss:0.125531 test-logloss:0.146289
## [25] train-logloss:0.120791 test-logloss:0.142693
## [26] train-logloss:0.118051 test-logloss:0.139899
## [27] train-logloss:0.113890 test-logloss:0.136532
## [28] train-logloss:0.110343 test-logloss:0.133553
## [29] train-logloss:0.107051 test-logloss:0.130759
## [30] train-logloss:0.104148 test-logloss:0.128428
## [31] train-logloss:0.101383 test-logloss:0.126409
## [32] train-logloss:0.099031 test-logloss:0.124483
## [33] train-logloss:0.097492 test-logloss:0.123358
## [34] train-logloss:0.095159 test-logloss:0.121917
## [35] train-logloss:0.094473 test-logloss:0.121547
## [36] train-logloss:0.092616 test-logloss:0.120371
## [37] train-logloss:0.090554 test-logloss:0.119301
## [38] train-logloss:0.088817 test-logloss:0.118244
## [39] train-logloss:0.087445 test-logloss:0.117405
## [40] train-logloss:0.086014 test-logloss:0.116530
## [41] train-logloss:0.084754 test-logloss:0.115580
## [42] train-logloss:0.083631 test-logloss:0.114904
## [43] train-logloss:0.082379 test-logloss:0.114289
## [44] train-logloss:0.081223 test-logloss:0.113716
## [45] train-logloss:0.080282 test-logloss:0.113092
## [46] train-logloss:0.079407 test-logloss:0.112634
## [47] train-logloss:0.078634 test-logloss:0.112557
## [48] train-logloss:0.077629 test-logloss:0.112271
## [49] train-logloss:0.076872 test-logloss:0.112110
## [50] train-logloss:0.076057 test-logloss:0.111826
## [51] train-logloss:0.075172 test-logloss:0.111830
## [52] train-logloss:0.074348 test-logloss:0.111555
## [53] train-logloss:0.073854 test-logloss:0.111476
## [54] train-logloss:0.073207 test-logloss:0.111389
## [55] train-logloss:0.072740 test-logloss:0.111280
## [56] train-logloss:0.071966 test-logloss:0.111071
## [57] train-logloss:0.071495 test-logloss:0.110951
## [58] train-logloss:0.070955 test-logloss:0.110909
## [59] train-logloss:0.070698 test-logloss:0.110868
## [60] train-logloss:0.070346 test-logloss:0.110629
## [61] train-logloss:0.070082 test-logloss:0.110552
## [62] train-logloss:0.069467 test-logloss:0.110589
## [63] train-logloss:0.069110 test-logloss:0.110605
## [64] train-logloss:0.068891 test-logloss:0.110235
## [65] train-logloss:0.068188 test-logloss:0.109897
## [66] train-logloss:0.067574 test-logloss:0.110155
## [67] train-logloss:0.067239 test-logloss:0.110109
## [68] train-logloss:0.067093 test-logloss:0.109921
## [69] train-logloss:0.066588 test-logloss:0.110098
## [70] train-logloss:0.066082 test-logloss:0.110419
## [71] train-logloss:0.065747 test-logloss:0.110524
## [72] train-logloss:0.065411 test-logloss:0.110210
## [73] train-logloss:0.065194 test-logloss:0.110247
## [74] train-logloss:0.064744 test-logloss:0.110331
## [75] train-logloss:0.064438 test-logloss:0.110252
## Stopping. Best iteration:
## [65] train-logloss:0.068188 test-logloss:0.109897
# 5. Model Evaluation ---------------------------------------------------
# Make predictions
y_pred <- predict(xgb_model, dtest)
y_pred_class <- ifelse(y_pred > 0.5, 1, 0)
# Confusion Matrix
y_pred_class <- factor(y_pred_class, levels = c(0, 1))
y_test <- factor(y_test, levels = c(0, 1))
conf_matrix <- confusionMatrix(y_pred_class, y_test)
print(conf_matrix)
## Confusion Matrix and Statistics
##
## Reference
## Prediction 0 1
## 0 1331 66
## 1 61 1435
##
## Accuracy : 0.9561
## 95% CI : (0.948, 0.9633)
## No Information Rate : 0.5188
## P-Value [Acc > NIR] : <2e-16
##
## Kappa : 0.9121
##
## Mcnemar's Test P-Value : 0.7226
##
## Sensitivity : 0.9562
## Specificity : 0.9560
## Pos Pred Value : 0.9528
## Neg Pred Value : 0.9592
## Prevalence : 0.4812
## Detection Rate : 0.4601
## Detection Prevalence : 0.4829
## Balanced Accuracy : 0.9561
##
## 'Positive' Class : 0
##
# AUC and ROC Curve
roc_obj <- roc(as.numeric(as.character(y_test)), y_pred)
auc_value <- auc(roc_obj)
cat("AUC:", auc_value, "\n")
## AUC: 0.9914303
# Plot ROC Curve
plot(roc_obj, main = "ROC Curve", col = "blue", lwd = 2)
abline(a = 0, b = 1, lty = 2, col = "red")
# 6. Feature Importance -------------------------------------------------
importance_matrix <- xgb.importance(feature_names = colnames(X_train), model = xgb_model)
xgb.plot.importance(importance_matrix, top_n = 10, main = "Feature Importance")
# 7. Save Model ---------------------------------------------------------
xgb.save(xgb_model, "xgb_app_success.model")
## [1] TRUE
# Summary
cat("Gradient Boosting achieved an accuracy of", conf_matrix$overall["Accuracy"],
"and AUC of", auc_value, "\n")
## Gradient Boosting achieved an accuracy of 0.9561009 and AUC of 0.9914303
# Remove the Installs and Installs numerical columns
data <- data_updated[, !colnames(data_updated) %in% c("Installs")]
# Split the data into training and testing sets
set.seed(123) # Ensure reproducibility
trainIndex <- createDataPartition(data$Installs_Category, p = 0.8, list = FALSE)
trainData <- data[trainIndex, ]
testData <- data[-trainIndex, ]
# Load necessary libraries
library(rpart)
library(rpart.plot)
# Fit the decision tree model
set.seed(42)
tree_model <- rpart(
Installs_Category ~ . ,
data = trainData,
method = "class"
)
# Plot the decision tree
rpart.plot(tree_model, main = "Decision Tree for Predicting Installs Category")
# Predict on training and test datasets
train_predictions <- predict(tree_model, trainData, type = "class")
test_predictions <- predict(tree_model, testData, type = "class")
# Calculate accuracy
train_accuracy <- sum(train_predictions == trainData$Installs_Category) / nrow(trainData)
test_accuracy <- sum(test_predictions == testData$Installs_Category) / nrow(testData)
# Print accuracy results
cat("Training Accuracy: ", train_accuracy, "\n")
## Training Accuracy: 0.9479005
cat("Test Accuracy: ", test_accuracy, "\n")
## Test Accuracy: 0.9470954
# Check feature importance
importance <- tree_model$variable.importance
# Print feature importance
cat("Feature Importance:\n")
## Feature Importance:
print(importance)
## Reviews Success lastupdate Rating Size catGAME Price
## 2874.68475 1939.83505 537.88084 303.44420 276.97554 239.16318 85.62621
# Visualize feature importance (optional)
barplot(
importance,
main = "Feature Importance",
xlab = "Features",
ylab = "Importance",
col = "steelblue",
las = 2
)
Why shift to Random Forest? High Dimensionality: With 41 variables, random forest handles many features better and can identify the most important ones. Feature Importance: Random forest provides a ranking of feature importance, helping us understand which variables influence the Installs_Category. Accuracy: Random forest generally has better predictive accuracy for larger and more complex datasets.
In this analysis, we employ a Random Forest model to predict the number of installs based on the top 5 app categories. The Random Forest algorithm is a robust ensemble learning method that builds multiple decision trees and combines their predictions to improve accuracy and reduce overfitting.
The analysis will focus on:
Feature Selection: Using the top 5 categories as predictors, which are highly correlated with app performance metrics such as installs, ratings, and reviews. Model Objective: Accurately predict the number of installs by capturing the complex and nonlinear relationships between features using a Random Forest model. Evaluation Metrics: Assess the model’s performance using metrics such as Mean Squared Error (MSE), R-squared, and visualization of feature importance to ensure the model’s predictions are interpretable and actionable.
library(randomForest)
library(caret)
# Train the random forest model
set.seed(123)
rf_model <- randomForest(Installs_Category ~ .,
data = trainData,
ntree = 500, # Number of trees
mtry = 10, # Number of predictors sampled at each split
importance = TRUE) # Enable importance calculation
# Print the model summary
print(rf_model)
##
## Call:
## randomForest(formula = Installs_Category ~ ., data = trainData, ntree = 500, mtry = 10, importance = TRUE)
## Type of random forest: classification
## Number of trees: 500
## No. of variables tried at each split: 10
##
## OOB estimate of error rate: 4.77%
## Confusion matrix:
## 0 1 class.error
## 0 4418 178 0.03872933
## 1 190 2930 0.06089744
plot(rf_model)
# Predictions on the training set
train_predictions <- predict(rf_model, trainData)
# Predictions on the testing set
test_predictions <- predict(rf_model, testData)
# Confusion Matrix for Training Data
train_cm <- confusionMatrix(train_predictions, trainData$Installs_Category)
print(train_cm)
## Confusion Matrix and Statistics
##
## Reference
## Prediction 0 1
## 0 4585 2
## 1 11 3118
##
## Accuracy : 0.9983
## 95% CI : (0.9971, 0.9991)
## No Information Rate : 0.5956
## P-Value [Acc > NIR] : <2e-16
##
## Kappa : 0.9965
##
## Mcnemar's Test P-Value : 0.0265
##
## Sensitivity : 0.9976
## Specificity : 0.9994
## Pos Pred Value : 0.9996
## Neg Pred Value : 0.9965
## Prevalence : 0.5956
## Detection Rate : 0.5942
## Detection Prevalence : 0.5945
## Balanced Accuracy : 0.9985
##
## 'Positive' Class : 0
##
# Confusion Matrix for Testing Data
test_cm <- confusionMatrix(test_predictions, testData$Installs_Category)
print(test_cm)
## Confusion Matrix and Statistics
##
## Reference
## Prediction 0 1
## 0 1110 51
## 1 38 729
##
## Accuracy : 0.9538
## 95% CI : (0.9435, 0.9628)
## No Information Rate : 0.5954
## P-Value [Acc > NIR] : <2e-16
##
## Kappa : 0.9039
##
## Mcnemar's Test P-Value : 0.2034
##
## Sensitivity : 0.9669
## Specificity : 0.9346
## Pos Pred Value : 0.9561
## Neg Pred Value : 0.9505
## Prevalence : 0.5954
## Detection Rate : 0.5757
## Detection Prevalence : 0.6022
## Balanced Accuracy : 0.9508
##
## 'Positive' Class : 0
##
# Calculate training and testing accuracy
train_accuracy <- sum(train_predictions == trainData$Installs_Category) / nrow(trainData)
test_accuracy <- sum(test_predictions == testData$Install_Category) / nrow(testData)
# Prepare data for plotting
errors <- data.frame(
Trees = 1:500,
OOB = rf_model$err.rate[, 1],
TrainAccuracy = rep(train_accuracy, 500),
TestAccuracy = rep(test_accuracy, 500)
)
# Plot the errors
plot(errors$Trees, errors$OOB, type = "l", col = "blue", lwd = 2,
ylim = c(0, 1), xlab = "Number of Trees", ylab = "Error/Accuracy",
main = "Training, Testing, and OOB Error Rates")
lines(errors$Trees, 1 - errors$TrainAccuracy, col = "green", lwd = 2, lty = 2) # Training error
lines(errors$Trees, 1 - errors$TestAccuracy, col = "red", lwd = 2, lty = 2) # Testing error
legend("topright", legend = c("OOB Error", "Training Error", "Testing Error"),
col = c("blue", "green", "red"), lwd = 2, lty = c(1, 2, 2))
# Variable importance
importance(rf_model)
## 0 1 MeanDecreaseAccuracy
## Rating 12.3143267 16.6806973 19.98208469
## Reviews 29.1350290 151.8967955 77.30996319
## Size 6.6652651 9.7777886 12.02398142
## Price 46.9206523 57.2074894 64.43385934
## content_rating 10.2770840 1.5923851 8.21855263
## lastupdate 6.0252421 13.7498776 14.81094638
## catART_AND_DESIGN -4.6051233 -0.2165312 -3.45042719
## catAUTO_AND_VEHICLES -3.5269954 3.9297943 0.79451583
## catBEAUTY -1.8818210 0.7175801 -0.83122085
## catBOOKS_AND_REFERENCE 1.2987280 6.1796501 5.68559158
## catBUSINESS 1.3008149 4.2574561 4.61281955
## catCOMICS 2.5698036 -2.2443011 0.05459819
## catCOMMUNICATION 0.6471630 -0.4981754 0.06433939
## catDATING 0.1662498 -3.9717151 -3.24500605
## catEDUCATION 0.4332665 5.2348424 3.13959574
## catENTERTAINMENT 4.3583673 0.0749067 4.27366696
## catEVENTS 7.3509912 9.4533758 11.26804079
## catFAMILY -4.2677780 4.8104059 2.37953616
## catFINANCE 2.7344530 0.9457580 2.65559850
## catFOOD_AND_DRINK -4.3344813 0.1210452 -2.92672629
## catGAME 5.2343298 -2.3799702 2.87150712
## catHEALTH_AND_FITNESS -2.7017337 2.5330758 0.01158012
## catHOUSE_AND_HOME -1.8879437 -0.8949606 -1.89664750
## catLIBRARIES_AND_DEMO -5.1229906 2.7183140 -2.39524501
## catLIFESTYLE 2.8988634 -2.5279339 0.19344419
## catMAPS_AND_NAVIGATION -1.3880427 -1.3726584 -1.89691932
## catMEDICAL -1.7768118 14.9552421 14.41703332
## catNEWS_AND_MAGAZINES 2.7310481 4.3920915 5.32529587
## catPARENTING -5.3327553 4.0424688 -0.92325506
## catPERSONALIZATION -0.7005746 2.9871164 1.66920698
## catPHOTOGRAPHY 5.1940963 2.4619638 5.65700702
## catPRODUCTIVITY 3.0805891 -3.0220535 0.11810667
## catSHOPPING -0.7297752 -1.9448289 -1.85870472
## catSOCIAL -1.7231937 -0.8326130 -1.85841365
## catSPORTS 0.9142572 -1.1389045 -0.28319962
## catTOOLS 3.5425592 3.7780611 5.26140007
## catTRAVEL_AND_LOCAL 1.1628032 0.2542499 1.10365293
## catVIDEO_PLAYERS 1.0405432 -0.2697573 0.49874408
## catWEATHER -2.7008219 -3.4120292 -4.29987447
## Success 24.5198424 36.2257913 42.33695561
## MeanDecreaseGini
## Rating 132.4771859
## Reviews 1990.4086063
## Size 153.4647688
## Price 82.7671081
## content_rating 21.1893183
## lastupdate 179.7881521
## catART_AND_DESIGN 1.9802876
## catAUTO_AND_VEHICLES 3.8894313
## catBEAUTY 1.2238681
## catBOOKS_AND_REFERENCE 6.0756462
## catBUSINESS 4.2035011
## catCOMICS 2.2112898
## catCOMMUNICATION 0.9468554
## catDATING 2.8769271
## catEDUCATION 6.1706757
## catENTERTAINMENT 3.0470463
## catEVENTS 2.8896901
## catFAMILY 11.7899708
## catFINANCE 4.0657885
## catFOOD_AND_DRINK 2.4636973
## catGAME 13.8727408
## catHEALTH_AND_FITNESS 5.5084624
## catHOUSE_AND_HOME 3.0931222
## catLIBRARIES_AND_DEMO 1.8163560
## catLIFESTYLE 5.1587322
## catMAPS_AND_NAVIGATION 1.2967987
## catMEDICAL 12.6007062
## catNEWS_AND_MAGAZINES 3.0533446
## catPARENTING 3.0709891
## catPERSONALIZATION 3.9217580
## catPHOTOGRAPHY 6.9459726
## catPRODUCTIVITY 4.7614276
## catSHOPPING 2.3981296
## catSOCIAL 3.3124995
## catSPORTS 5.3446551
## catTOOLS 7.5616495
## catTRAVEL_AND_LOCAL 3.1566718
## catVIDEO_PLAYERS 3.1459915
## catWEATHER 2.8293681
## Success 919.6286663
# Plot variable importance
varImpPlot(rf_model)
#### Visualization for Feature Importance
# Extract importance values
importance_values <- importance(rf_model)
importance_df <- data.frame(
Feature = rownames(importance_values),
MeanDecreaseAccuracy = importance_values[, "MeanDecreaseAccuracy"],
MeanDecreaseGini = importance_values[, "MeanDecreaseGini"]
)
# Plot Mean Decrease in Accuracy
accuracy_plot <- ggplot(importance_df, aes(x = reorder(Feature, MeanDecreaseAccuracy), y = MeanDecreaseAccuracy)) +
geom_bar(stat = "identity", fill = "skyblue") +
coord_flip() +
labs(
title = "Feature Importance (Mean Decrease in Accuracy)",
x = "Features",
y = "Importance"
) +
theme_minimal() +
theme(text = element_text(size = 12), axis.text.y = element_text(size = 10))
# Plot the accuracy plot
print(accuracy_plot)
# Save the plot with larger dimensions
#ggsave("feature_importance_accuracy_large.png", plot = accuracy_plot, width = 12, height = 10, dpi = 300)
# Plot Mean Decrease in Gini
gini_plot <- ggplot(importance_df, aes(x = reorder(Feature, MeanDecreaseGini), y = MeanDecreaseGini)) +
geom_bar(stat = "identity", fill = "lightgreen") +
coord_flip() +
labs(
title = "Feature Importance (Mean Decrease in Gini)",
x = "Features",
y = "Importance"
) +
theme_minimal() +
theme(text = element_text(size = 12), axis.text.y = element_text(size = 10))
# Plot the gini index plot
print(gini_plot)
# Save the plot with larger dimensions
#ggsave("feature_importance_gini_large.png", plot = gini_plot, width = 12, height = 10, dpi = 300)
# Convert target variable to a factor
y <- as.factor(data_updated$Installs_Category)
# Remove unused columns
X <- data_updated[, !names(data_updated) %in% c('Installs', 'Installs_Category', 'Installs_Num')]
# Split data into training and testing sets
set.seed(42)
trainIndex <- createDataPartition(y, p = 0.75, list = FALSE)
X_train <- X[trainIndex, ]
X_test <- X[-trainIndex, ]
y_train <- y[trainIndex]
y_test <- y[-trainIndex]
library(ggplot2)
pca <- prcomp(X_train, center = TRUE, scale. = TRUE)
pca_data <- data.frame(PC1 = pca$x[, 1], PC2 = pca$x[, 2], Class = y_train)
ggplot(pca_data, aes(x = PC1, y = PC2, color = Class)) +
geom_point(size = 3) +
labs(title = "PCA: Decision Boundary Visualization")
As we can see we cannot decide if the boundary is linear or non-linear hence, lets make two models linear and non-linear SVM to check which one is a better fit.
# Load necessary libraries
library(e1071)
library(caret)
# Assuming you have already defined X_train, y_train, X_test, y_test
# Combine the training data into a data frame
train_data <- as.data.frame(cbind(X_train, y_train))
# Set up k-fold cross-validation
set.seed(42)
train_control <- trainControl(method = "cv", number = 5)
# Define the tuning grid for 'C' and 'sigma' (gamma)
tune_grid <- expand.grid(C = c( 0.1, 1, 10, 100),
sigma = c(0.5, 1))
# Train the SVM model using radial kernel with the tuning grid
svm_model <- train(y_train ~ ., data = train_data,
method = "svmRadial",
tuneGrid = tune_grid,
trControl = train_control)
# Print the results of the tuning
print(svm_model)
## Support Vector Machines with Radial Basis Function Kernel
##
## 7233 samples
## 40 predictor
## 2 classes: '0', '1'
##
## No pre-processing
## Resampling: Cross-Validated (5 fold)
## Summary of sample sizes: 5786, 5787, 5786, 5787, 5786
## Resampling results across tuning parameters:
##
## C sigma Accuracy Kappa
## 0.1 0.5 0.8472313 0.6830532
## 0.1 1.0 0.8110081 0.5974862
## 1.0 0.5 0.8758494 0.7486983
## 1.0 1.0 0.8678302 0.7306049
## 10.0 0.5 0.8721155 0.7392163
## 10.0 1.0 0.8623001 0.7164166
## 100.0 0.5 0.8642357 0.7199747
## 100.0 1.0 0.8533130 0.6950222
##
## Accuracy was used to select the optimal model using the largest value.
## The final values used for the model were sigma = 0.5 and C = 1.
# Best model parameters
best_params <- svm_model$bestTune
cat("Best Parameters:\n")
## Best Parameters:
print(best_params)
## sigma C
## 3 0.5 1
As seen for the training set the best accuracy is achieved when C = 100 and gamma is 0.05
# Load necessary libraries
library(e1071)
library(caret)
# Assuming you have already defined X_train, y_train, X_test, y_test
# Combine the training data into a data frame
train_data <- as.data.frame(cbind(X_train, y_train))
# Set up k-fold cross-validation
set.seed(42)
train_control <- trainControl(method = "cv", number = 5)
# Define the tuning grid for 'C' and 'sigma' (gamma)
tune_grid <- expand.grid(C = c( 0.1, 1, 10, 100))
# Train the SVM model using radial kernel with the tuning grid
svm_model <- train(y_train ~ ., data = train_data,
method = "svmLinear",
tuneGrid = tune_grid,
trControl = train_control)
# Print the results of the tuning
print(svm_model)
## Support Vector Machines with Linear Kernel
##
## 7233 samples
## 40 predictor
## 2 classes: '0', '1'
##
## No pre-processing
## Resampling: Cross-Validated (5 fold)
## Summary of sample sizes: 5786, 5787, 5786, 5787, 5786
## Resampling results across tuning parameters:
##
## C Accuracy Kappa
## 0.1 0.8842829 0.7692133
## 1.0 0.8903658 0.7805455
## 10.0 0.8960346 0.7914651
## 100.0 0.9300447 0.8539968
##
## Accuracy was used to select the optimal model using the largest value.
## The final value used for the model was C = 100.
# Best model parameters
best_params <- svm_model$bestTune
cat("Best Parameters:\n")
## Best Parameters:
print(best_params)
## C
## 4 100
For linear model it could be seen that at C = 100, we attain an accuracy of 92 percent which suggets that the model is linearly seperable. Hence, now lets find the accuracy, ROC, AUC score of the test data.
y_test
## [1] 1 0 1 1 0 1 0 0 0 0 0 0 1 0 0 0 0 0 1 0 1 0 1 1 1 1 0 0 1 1 1 0 0 0 1 0 1
## [38] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 0 1 0 1 1 1 0 0 1 0 1 1 1 1
## [75] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 0 0 1 0 1 1 1 1 1 1 0 0 1
## [112] 1 0 0 0 1 0 0 0 0 1 0 0 0 0 0 0 0 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
## [149] 0 0 0 0 1 1 1 1 1 0 1 1 1 0 1 1 0 1 1 1 0 0 1 0 0 0 1 1 0 0 0 0 1 1 0 1 0
## [186] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 0 0 0 0 0 0 0 0 0 1 1
## [223] 1 1 1 1 1 1 1 1 1 1 1 1 0 1 1 0 0 1 0 1 1 0 0 1 1 1 1 0 1 1 1 1 1 1 1 1 1
## [260] 1 1 0 1 0 1 1 1 0 1 1 1 1 1 1 1 0 1 1 1 1 1 0 1 1 1 1 1 1 0 1 1 0 0 0 1 0
## [297] 1 1 1 1 1 1 1 1 1 1 0 0 0 0 1 1 0 0 1 0 0 1 0 0 0 0 1 0 0 1 1 1 1 1 1 1 1
## [334] 1 1 1 1 1 1 1 1 0 1 0 1 1 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
## [371] 1 1 1 1 1 1 1 1 1 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
## [408] 1 0 0 1 1 1 0 1 1 1 1 1 1 1 1 1 1 1 0 1 1 1 1 1 1 1 0 1 0 0 0 1 0 0 1 0 1
## [445] 1 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 0 0 1 0 1 0 1 0 1 1 0 1 1 0
## [482] 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 1 1 1 1 0 1 1 1 1 1 1 1
## [519] 1 1 1 1 1 1 1 1 1 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
## [556] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 0 1 1 1 1 1 1 1 1 0 0 1 1 1 1 1
## [593] 1 1 1 0 1 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 0 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
## [630] 1 1 1 1 1 1 1 1 1 0 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 0 1 1 1 1
## [667] 1 1 1 1 1 1 1 1 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 0 1 1 1 1 1
## [704] 0 1 0 1 0 0 0 0 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
## [741] 1 1 1 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 0 1 1 1 1 1 1 1 1 1 1 1 1 1
## [778] 1 0 1 0 0 1 1 1 0 0 0 0 0 0 0 1 1 1 1 1 1 1 0 1 1 1 1 0 1 0 0 0 0 0 0 1 0
## [815] 0 0 0 0 0 1 0 0 0 1 1 1 1 1 1 1 1 1 0 1 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1
## [852] 0 1 0 1 0 1 0 1 0 1 0 0 0 0 1 0 0 0 0 1 1 0 1 0 1 0 0 0 0 0 0 1 0 1 0 0 1
## [889] 0 0 0 0 0 0 1 0 0 0 0 1 0 1 1 1 1 1 0 1 1 1 0 1 1 0 1 0 1 1 1 0 0 1 0 0 0
## [926] 0 1 0 1 0 1 1 1 1 1 0 0 1 0 0 1 1 1 1 1 0 1 1 0 1 0 1 0 1 1 1 1 0 1 1 1 1
## [963] 0 1 1 0 0 1 1 0 0 0 0 0 0 0 0 1 1 0 1 0 1 0 0 0 1 0 0 0 1 0 0 0 1 1 1 1 0
## [1000] 1 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 1 1 1 0 0 0 1 0 0 0 0 1 0 1 1 0 0 0 0 0 0
## [1037] 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0
## [1074] 0 0 0 0 0 0 0 0 0 0 1 0 1 0 0 1 1 0 0 0 1 0 0 0 0 1 0 0 0 0 0 1 1 1 1 0 1
## [1111] 0 0 1 1 1 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 1 1 0 0 0 0 0 1 1 0
## [1148] 0 0 0 0 0 1 1 1 1 1 1 1 0 0 1 0 0 0 0 0 0 0 0 0 0 0 1 0 0 1 1 0 1 0 0 0 0
## [1185] 0 1 0 1 1 0 1 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 0 0
## [1222] 0 0 0 0 0 0 0 0 0 0 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 1 0 0 0 0 0 0
## [1259] 0 0 0 1 0 0 1 0 0 0 0 1 0 1 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 1 1 0 0 0 0 0
## [1296] 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 1 0 1 0 0 0 0 0 0 0 0 0 0 0 0 1 1 0
## [1333] 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0
## [1370] 0 0 1 0 1 0 0 1 0 0 0 0 0 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0
## [1407] 0 0 0 0 0 0 0 1 1 0 1 1 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 1 0 0 0 0 0 0 0 1 0
## [1444] 0 0 0 0 0 0 0 0 0 0 1 0 1 0 0 0 0 0 0 0 0 0 1 0 1 0 0 0 0 0 0 0 0 0 0 0 0
## [1481] 0 0 0 0 0 0 0 0 0 1 1 0 1 0 1 0 1 0 0 1 0 1 1 0 0 0 0 0 0 0 0 1 0 0 0 0 0
## [1518] 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0
## [1555] 0 0 0 0 1 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 1 0 1 0 0 0 0 1 0 0 0 0
## [1592] 0 0 0 0 0 1 0 0 0 1 0 1 0 1 0 1 0 0 1 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
## [1629] 0 0 0 1 1 0 1 0 1 0 0 1 0 0 1 0 1 0 1 1 1 1 1 1 1 1 0 0 0 0 0 0 0 0 0 0 1
## [1666] 1 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 1 0 0 0 0 0 1 0 0 0 0 0 1 0
## [1703] 0 1 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 1 0 0 0 0 0 0 0 1 0 0
## [1740] 0 0 0 0 0 0 0 0 0 0 1 0 0 0 1 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 1 0 0 0 1
## [1777] 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 0 1 0 0 0 0 0 0 0 0 1 1 0 0 0 1 1 1 1 1 0
## [1814] 0 0 0 1 0 0 1 1 0 0 0 1 1 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 0 0 0 0 0 0 0 0
## [1851] 0 0 0 1 0 1 1 0 1 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 0 1
## [1888] 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 1 1 1 1 1 0 0 0 1 0 0 1 1 1 0 0 0 0 0 0
## [1925] 0 0 0 1 0 0 0 0 1 0 0 1 1 0 0 0 0 0 1 1 1 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0
## [1962] 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 1 1 1 1 1
## [1999] 1 1 1 1 0 1 1 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
## [2036] 0 0 0 0 1 0 0 1 0 1 1 1 0 0 0 1 1 1 0 1 0 0 0 1 0 0 1 0 0 1 1 0 1 1 0 1 1
## [2073] 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 1 0 1 0 0 1 1 0 1 0 1 1 1 1 1 1 1 1 1 0 0 0
## [2110] 0 1 1 1 0 1 1 0 0 0 0 0 0 0 0 0 0 0 1 1 1 0 0 0 0 1 0 0 0 0 1 1 1 1 1 1 0
## [2147] 0 0 0 0 0 0 1 0 0 0 0 0 1 1 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
## [2184] 0 0 0 0 0 0 0 1 1 0 0 0 0 0 0 1 1 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1
## [2221] 1 0 0 0 1 0 0 0 0 0 1 0 1 1 0 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
## [2258] 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 0 1 1 1 0 0 0 1 0 0 1 1 0 0
## [2295] 1 0 0 1 0 1 1 0 1 0 0 1 0 0 0 0 0 0 0 0 0 1 1 0 0 0 1 0 0 0 0 0 0 0 0 0 0
## [2332] 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0
## [2369] 0 1 0 0 0 0 1 0 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 1 0 1 0 1 0 0
## [2406] 0 0 0 0 0 0
## Levels: 0 1
# Load necessary libraries
library(e1071)
library(pROC) # For ROC and AUC
# Assuming you have already defined X_train, y_train, X_test, y_test
# Combine the training data into a data frame
train_data <- as.data.frame(cbind(X_train, y_train))
# Fit the SVM model with linear kernel
svm_model <- svm(y_train ~ ., data = train_data, kernel = "linear", cost = 100, decision.values = TRUE)
# Step 1: Make predictions on the test set
predictions <- predict(svm_model, newdata = as.data.frame(X_test))
# Step 2: Create confusion matrix
confusion_matrix <- table(Predicted = predictions, Actual = y_test)
cat("Confusion Matrix:\n")
## Confusion Matrix:
print(confusion_matrix)
## Actual
## Predicted 0 1
## 0 1389 101
## 1 47 874
# Step 3: Calculate accuracy
accuracy <- sum(diag(confusion_matrix)) / sum(confusion_matrix)
cat("Accuracy:", accuracy, "\n")
## Accuracy: 0.9386147
# Step 4: Get decision values for ROC curve
fitted <- attributes(predict(svm_model, newdata = as.data.frame(X_test), decision.values = TRUE))$decision.values
# Step 5: Generate ROC plot for the test set
roc_curve <- roc(y_test, -fitted) # Note: Use negative for class labeling
# Plot the ROC curve
plot(roc_curve, main = "ROC Curve for Test Data")
# Add AUC to the plot
auc_value <- auc(roc_curve)
legend("bottomright", legend = paste("AUC =", round(auc_value, 2)), bty = "n")
# Assuming you have already trained your SVM model (svm_model) using e1071
# Get coefficients from the SVM model
coefficients <- as.vector(svm_model$coefs) %*% svm_model$SV
# Get the intercept term
intercept <- svm_model$rho
# Combine coefficients and intercept into a single vector
all_coefficients <- c(intercept, coefficients)
# Print coefficients
cat("Coefficients (including intercept):\n")
## Coefficients (including intercept):
print(all_coefficients)
## [1] 2.085058e+01 1.087825e-01 -1.767870e+02 -1.390152e-01 1.299278e+01
## [6] 8.604962e-02 9.476142e-02 3.703294e-02 4.115168e-02 4.284333e-05
## [11] 4.955957e-03 4.937072e-02 2.413850e-02 -3.674008e-02 -5.317158e-02
## [16] -7.816514e-02 -4.658387e-02 3.927985e-02 1.278253e-01 5.874558e-02
## [21] 3.353022e-02 3.754532e-02 -1.142397e-01 -7.283130e-02 -3.366334e-02
## [26] 1.347114e-02 3.575670e-03 1.449530e-01 6.237576e-02 1.882492e-02
## [31] -1.224170e-01 -1.774672e-01 -3.534363e-02 -7.687325e-02 -6.545884e-02
## [36] 5.654285e-02 5.225918e-02 -8.242140e-02 -1.134773e-01 -1.478470e-02
## [41] -9.333093e-01
# Check the number of coefficients
num_coefficients <- length(all_coefficients)
cat("Number of Coefficients (including intercept):", num_coefficients, "\n")
## Number of Coefficients (including intercept): 41
# Get feature names
feature_names <- colnames(X_train)
# Create a named vector for coefficients with feature names
named_coefficients <- setNames(coefficients, feature_names)
# Print named coefficients
cat("Feature Coefficients:\n")
## Feature Coefficients:
print(named_coefficients)
## Rating Reviews Size Price content_rating lastupdate
## [1,] 0.1087825 -176.787 -0.1390152 12.99278 0.08604962 0.09476142
## catART_AND_DESIGN catAUTO_AND_VEHICLES catBEAUTY catBOOKS_AND_REFERENCE
## [1,] 0.03703294 0.04115168 4.284333e-05 0.004955957
## catBUSINESS catCOMICS catCOMMUNICATION catDATING catEDUCATION
## [1,] 0.04937072 0.0241385 -0.03674008 -0.05317158 -0.07816514
## catENTERTAINMENT catEVENTS catFAMILY catFINANCE catFOOD_AND_DRINK
## [1,] -0.04658387 0.03927985 0.1278253 0.05874558 0.03353022
## catGAME catHEALTH_AND_FITNESS catHOUSE_AND_HOME catLIBRARIES_AND_DEMO
## [1,] 0.03754532 -0.1142397 -0.0728313 -0.03366334
## catLIFESTYLE catMAPS_AND_NAVIGATION catMEDICAL catNEWS_AND_MAGAZINES
## [1,] 0.01347114 0.00357567 0.144953 0.06237576
## catPARENTING catPERSONALIZATION catPHOTOGRAPHY catPRODUCTIVITY catSHOPPING
## [1,] 0.01882492 -0.122417 -0.1774672 -0.03534363 -0.07687325
## catSOCIAL catSPORTS catTOOLS catTRAVEL_AND_LOCAL catVIDEO_PLAYERS
## [1,] -0.06545884 0.05654285 0.05225918 -0.0824214 -0.1134773
## catWEATHER Success
## [1,] -0.0147847 -0.9333093
## attr(,"names")
## [1] "Rating" "Reviews" "Size"
## [4] "Price" "content_rating" "lastupdate"
## [7] "catART_AND_DESIGN" "catAUTO_AND_VEHICLES" "catBEAUTY"
## [10] "catBOOKS_AND_REFERENCE" "catBUSINESS" "catCOMICS"
## [13] "catCOMMUNICATION" "catDATING" "catEDUCATION"
## [16] "catENTERTAINMENT" "catEVENTS" "catFAMILY"
## [19] "catFINANCE" "catFOOD_AND_DRINK" "catGAME"
## [22] "catHEALTH_AND_FITNESS" "catHOUSE_AND_HOME" "catLIBRARIES_AND_DEMO"
## [25] "catLIFESTYLE" "catMAPS_AND_NAVIGATION" "catMEDICAL"
## [28] "catNEWS_AND_MAGAZINES" "catPARENTING" "catPERSONALIZATION"
## [31] "catPHOTOGRAPHY" "catPRODUCTIVITY" "catSHOPPING"
## [34] "catSOCIAL" "catSPORTS" "catTOOLS"
## [37] "catTRAVEL_AND_LOCAL" "catVIDEO_PLAYERS" "catWEATHER"
## [40] "Success"
# Sort coefficients by absolute value for feature importance
sorted_coefficients <- sort(abs(named_coefficients), decreasing = TRUE)
# Print sorted feature importance
cat("Sorted Feature Importance:\n")
## Sorted Feature Importance:
print(sorted_coefficients)
## Reviews Price Success
## 1.767870e+02 1.299278e+01 9.333093e-01
## catPHOTOGRAPHY catMEDICAL Size
## 1.774672e-01 1.449530e-01 1.390152e-01
## catFAMILY catPERSONALIZATION catHEALTH_AND_FITNESS
## 1.278253e-01 1.224170e-01 1.142397e-01
## catVIDEO_PLAYERS Rating lastupdate
## 1.134773e-01 1.087825e-01 9.476142e-02
## content_rating catTRAVEL_AND_LOCAL catEDUCATION
## 8.604962e-02 8.242140e-02 7.816514e-02
## catSHOPPING catHOUSE_AND_HOME catSOCIAL
## 7.687325e-02 7.283130e-02 6.545884e-02
## catNEWS_AND_MAGAZINES catFINANCE catSPORTS
## 6.237576e-02 5.874558e-02 5.654285e-02
## catDATING catTOOLS catBUSINESS
## 5.317158e-02 5.225918e-02 4.937072e-02
## catENTERTAINMENT catAUTO_AND_VEHICLES catEVENTS
## 4.658387e-02 4.115168e-02 3.927985e-02
## catGAME catART_AND_DESIGN catCOMMUNICATION
## 3.754532e-02 3.703294e-02 3.674008e-02
## catPRODUCTIVITY catLIBRARIES_AND_DEMO catFOOD_AND_DRINK
## 3.534363e-02 3.366334e-02 3.353022e-02
## catCOMICS catPARENTING catWEATHER
## 2.413850e-02 1.882492e-02 1.478470e-02
## catLIFESTYLE catBOOKS_AND_REFERENCE catMAPS_AND_NAVIGATION
## 1.347114e-02 4.955957e-03 3.575670e-03
## catBEAUTY
## 4.284333e-05
# Optional: Visualize feature importance
barplot(
sorted_coefficients,
main = "Feature Importance from SVM Coefficients",
xlab = "Features",
ylab = "Absolute Coefficient Value",
col = "steelblue",
las = 2,
cex.names = 0.7 # Adjust name size if necessary
)